CN117201871A

CN117201871A - Audio and video processing method and device, electronic equipment and storage medium

Info

Publication number: CN117201871A
Application number: CN202311198558.0A
Authority: CN
Inventors: 曾凡志
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-08

Abstract

The embodiment of the disclosure provides an audio and video processing method, an audio and video processing device, electronic equipment and a storage medium. Wherein the method comprises the following steps: acquiring data packet associated information corresponding to played audio and video data in a plurality of historical time slices in the process of playing the audio and video; determining feature data to be used corresponding to each historical time slice based on the data packet association information; determining a target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model; and processing the audio and video data packets to be played, which are cached in the jitter buffer area, based on the target prediction delay. According to the technical scheme, the effects of improving the prediction accuracy and the prediction efficiency of jitter delay are achieved, the algorithm iteration period is reduced, the video playing quality of audio and video is further optimized, and the watching experience of a user is improved.

Description

Audio and video processing method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of audio and video processing, in particular to an audio and video processing method, an audio and video processing device, electronic equipment and a storage medium.

Background

In the audio and video playing process, the delay index and the blocking index are important indexes for measuring the real-time communication quality. As the last link before the data packet is transmitted to the decoding module in the data packet transmission link, a jitter buffer (jitter buffer) is an important module in the real-time audio/video processing process. The jitter buffer can process the conditions of data packet loss, disorder, delay arrival and the like, smoothly output data packets/frames to the decoding module, resist the influence of various weak network environments on playing/rendering, reduce blocking and improve the viewing experience of users. In practical application, reasonably predicting jitter delay of jitter buffer is an important link for improving user experience.

In the related art, when predicting jitter delay of jitter buffer, a developer usually analyzes historical jitter delay based on a manual statistical algorithm, and then carries out iterative optimization through summarizing, algorithm tuning, edition issuing, on-line AB experiment and other processes, so as to finally obtain predicted delay.

However, the whole prediction process of the jitter delay is very complicated, the consumption period is long, the algorithm iteration speed is greatly influenced, in addition, the manual analysis often has deviation, and the predicted jitter delay can be excessively large or small, so that the prediction efficiency and accuracy of the jitter delay are reduced.

Disclosure of Invention

The embodiment of the disclosure provides an audio and video processing method, an audio and video processing device, electronic equipment and a storage medium, so that the effects of improving the prediction accuracy and the prediction efficiency of jitter delay are achieved, the algorithm iteration period is reduced, the video playing quality of audio and video is further optimized, and the watching experience of a user is improved.

In a first aspect, an embodiment of the present disclosure provides an audio/video processing method, including:

acquiring data packet associated information corresponding to played audio and video data in a plurality of historical time slices in the process of playing the audio and video;

determining feature data to be used corresponding to each historical time slice based on the data packet association information;

determining a target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model;

and processing the audio and video data packet to be played, which is cached in the jitter buffer area, based on the target prediction delay.

In a second aspect, an embodiment of the present disclosure further provides an audio/video processing apparatus, including:

the associated information acquisition module is used for acquiring corresponding data packet associated information of played audio and video data in a plurality of historical time slices in the process of playing the audio and video;

The characteristic data determining module is used for determining characteristic data to be used corresponding to each historical time slice based on the data packet association information;

the delay prediction module is used for determining a target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model;

and the data packet processing module is used for processing the audio and video data packet to be played, which is cached in the jitter buffer area, based on the target prediction delay.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the audio video processing method as described in any of the embodiments of the present disclosure.

In a fourth aspect, the presently disclosed embodiments also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, are used to perform an audio-video processing method as described in any of the presently disclosed embodiments.

According to the technical scheme, the data packet associated information corresponding to the played audio and video data in the historical time slices is obtained in the process of playing the audio and video, further, feature data to be used corresponding to each historical time slice is determined based on the data packet associated information, further, the target prediction delay corresponding to the current time slice is determined based on the feature data to be used and a predetermined delay prediction model, and finally, the audio and video data packets to be played, which are cached in the jitter buffer area, are processed based on the target prediction delay, so that the problems that the jitter delay prediction process is very complicated, the consumption period is long, the algorithm iteration speed is influenced, the prediction efficiency and the accuracy of the jitter delay are reduced, the effect of improving the prediction accuracy and the prediction efficiency of the jitter delay is achieved, the algorithm iteration period is reduced, the video playing quality of the audio and video is optimized, and the viewing experience of a user is improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of the related art in the case where the target prediction delay is too small;

FIG. 2 is a schematic diagram of the related art in the case where the target prediction delay is too large;

FIG. 3 is a schematic diagram of a target prediction delay determined in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a target prediction delay determination process provided in accordance with an embodiment of the present disclosure;

fig. 5 is a flowchart of an audio/video processing method according to an embodiment of the disclosure;

fig. 6 is a flowchart of an audio/video processing method according to an embodiment of the disclosure;

fig. 7 is a flowchart of an audio/video processing method according to an embodiment of the disclosure;

fig. 8 is a schematic structural diagram of an audio/video processing device according to an embodiment of the disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Before the present technical solution is introduced, an application scenario may be illustrated. The technical scheme of the embodiment of the disclosure can be applied to a scene for predicting the time delay of any time slice in the audio and video. In the related art, as a final link before a transmission link is sent to a decoder, a jitter buffer (jitter buffer) is an important module in real-time audio and video. The jitter buffer can process the conditions of data packet loss, disorder, delay arrival and the like, smoothly output data packets/frames to the decoding module, resist the influence of various weak network environments on playing/rendering, reduce blocking and improve the viewing experience of users. Generally, determining a target prediction delay of any time slice in an audio/video signal, so as to process audio/video data buffered in a jitter buffer based on the target prediction delay, is an important way to provide an audio/video playing effect. In the related art, the target delay is usually predicted by adopting a mode such as probability histogram jitter estimation, peak value calculation jitter estimation or kalman jitter estimation. However, in the case where the predicted target prediction delay is too small, there may be a case where the actual delay corresponding to the plurality of time slices is greater than the target prediction delay, or a case where the actual delay corresponding to the plurality of time slices is less than the target prediction delay, as shown in fig. 1. In fig. 1, the dashed line is the predicted target delay corresponding to each predicted time slice, each rectangular bar represents a time slice, and the height of the rectangular bar represents the actual delay corresponding to the time slice. For a time slice with the actual delay greater than the target prediction delay, that is, a time slice with the height of the rectangular bar higher than the position of the dotted line, the actual delay of the time slice and the target prediction delay are differentiated, the determined difference value can be expressed as the part of the rectangular bar above the dotted line, and as a blocking loss, the situation that the audio and video data played in the time slice can be blocked can be indicated. For a time slice with the actual delay less than the target prediction delay, that is, a time slice with the height of the rectangular bar lower than the position of the dotted line, the target prediction delay of the time slice is differentiated from the actual delay, and the determined difference value can be represented as a blank part below the dotted line as delay loss, so that the situation that delay waste exists in the audio and video data played in the time slice can be indicated. Alternatively, in the case where the predicted target prediction delay is too large, there may be a case where the actual delay corresponding to each time slice is smaller than the target prediction delay, as shown in fig. 2. The dashed line in fig. 2 is the predicted target delay corresponding to each time slice, and the height of the rectangular bar indicates the actual delay corresponding to the time slice, as shown in fig. 2, where the actual delay corresponding to each time slice is smaller than the predicted target delay, which can indicate that the predicted target delay is unreasonable and delay waste exists.

Based on the above, by adopting the technical scheme provided by the embodiment of the disclosure, the data packet associated information corresponding to the played audio and video data in a large number of historical time slices can be obtained, and further, the actual delay corresponding to the historical time slices can be determined according to the data packet associated information. Further, model input data can be built based on the determined actual time delay, and the built model input data is input into a time delay prediction model to obtain a target prediction time delay corresponding to the current time slice. Therefore, the audio and video data in the jitter buffer can be adjusted by the target prediction delay. As shown in fig. 3, the dashed line in fig. 3 is the target prediction delay determined based on the technical solution provided in the embodiments of the present disclosure. As can be seen from fig. 3, the dotted line is closely attached to most of the rectangular bars, which can indicate that the target prediction delay is reasonable. Therefore, the effect of accurately estimating jitter is achieved, and further, the experience of a user when watching audio and video can be improved.

Exemplary, as shown in fig. 4, a flow chart of a method for determining the target prediction delay is shown. As can be seen from fig. 4, in the case that the plurality of historical time slices corresponding to the current time slice are X historical time slices, that is, the historical time slice 1, the historical time slices 2, …, the historical time slice X, each historical time slice corresponds to one feature data to be used, and further, a data feature sequence [1,2, …, X can be constructed according to the feature data to be used corresponding to each historical time slice ]. Wherein each element in the sequence represents feature data to be used corresponding to a corresponding historical time slice. Further, the algorithm may be based on manual design algorithms such as probability histogram, periodic peak prediction, kalman prediction, etc., or on machine learning algorithms such as support vector machine, decision tree, long Short Term Memory (LSTM) model, convolutional neural network (Convolutional Neural Network, CNN) model, deep-neural network (Deep-Learning Neural Network, DNN) model, reinforcement learning, etc., for [1,2, …, x]Processing to obtain target prediction delay (x+1) corresponding to the current time slice X+1 ^′ . Thereafter, (x+1) ^′ The actual delay (x+1) corresponding to the current time slice is compared as follows:

delayloss＝(x+1) ^′ -(x+1),(x+1) ^′ ≥(x+1)

stallloss＝(x+1)-(x+1) ^′ ,(x+1) ^′ <(x+1)

the delayloss indicates that the target prediction delay is too large, and delay waste exists; the stallloss indicates that the target prediction delay is too small and there is a jam. Based on the technical scheme provided by the embodiment of the disclosure, the target prediction delay (x+1) can be realized ^′ The maximum degree is close to the actual delay (x+1), so that the prediction accuracy of the jitter delay is improved.

Before introducing the solution of the embodiment of the present disclosure, it should be further noted that the delay prediction model constructed based on the embodiment of the present disclosure may be deployed in a server or a client. The server may be a service program that provides services and resources to the client and has pertinence, and the device running the server is the server. Correspondingly, the client is a program corresponding to the server and providing local service for the user. Meanwhile, the client and the server may communicate based on various forms of text transmission protocols, such as hypertext transmission protocol (Hyper Text Transfer Protocol, HTTP). The delay prediction model in the embodiment of the disclosure is integrated into application software supporting various functions such as audio and video processing, and the software can be installed in an electronic device. Alternatively, the electronic device may be a mobile device or a PC terminal, etc. The application software may be a type of software for processing data such as audio, video or audio-video, and specific application software is not described here in detail, so long as the processing of the data such as audio, video or audio-video can be realized. And the method can also be a specially developed application program which is integrated in corresponding software or in corresponding pages, so that a user can realize the processing of related data through the pages integrated in the PC side.

Fig. 5 is a flow chart of an audio/video processing method provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is suitable for predicting a target prediction delay in an audio/video playing process, so as to adjust a jitter buffer based on the predicted target prediction delay.

As shown in fig. 5, the method includes:

s110, acquiring data packet associated information corresponding to the played audio and video data in a plurality of historical time slices in the audio and video playing process.

Wherein audio and video can be understood as a multimedia data stream consisting of audio and video together. Generally, for the same audio and video, in the process of playing the audio and video, it can be understood that a certain correlation exists between a video picture in the audio and video and audio content, and the video picture and the audio content are played simultaneously. In this embodiment, for audio and video, a corresponding time slice may be determined according to the length of the predicted time or a preset dividing rule, and for each time slice, one or more frames are corresponding, for example, when the predicted time length is one second, the time slice is 1S, and at this time, a plurality of audio and video frames may be included in 1S. On the basis, when the target delay of the audio and video is predicted, the predicted time period is the current time slice of the audio and video, and meanwhile, the last time period is the historical time slice of the audio and video relative to the predicted time period.

For example, when the predicted time period is one second, in the audio/video playing process, a plurality of moments before the current moment are taken as historical time slices, and optionally, five historical time slices exist 5S before the current moment. In this embodiment, the server or the client may predict the target delay in the current time. It should be noted that, in the actual application process, the length of the time slice may be set or adjusted based on the actual situation, which is not described in detail in the embodiment of the disclosure.

In this embodiment, after determining a plurality of historical time slices, packet associated information corresponding to the played audio/video data in each historical time slice may also be determined. The played audio and video data may be understood as data corresponding to the audio and video frames that have been played. The played audio-video data may be composed of a plurality of played data packets. The played data packet may include audio and video data that is played. Those skilled in the art will appreciate that a data packet is a medium of efficient transmission of data in a computer network comprised of multiple layers of protocols. Data packets are typically organized units of data that encapsulate text, images, executables, and other data, etc., that can be transmitted over a network in a reliable and efficient manner. The packet association information may be understood as information characterizing the transmission of the packet. In general, when a data packet is sent to a corresponding device, a marking process may be performed on the data packet, so that a sending time of the data packet is marked on the data packet, and the marked sending time of the data packet may be used as a sending timestamp of the data packet. Correspondingly, when the corresponding equipment end receives the data packet, the data packet can be subjected to marking processing, so that the data packet receiving time is marked on the data packet, and the marked data packet receiving time can be used as the data packet receiving time stamp of the data packet. In this embodiment, the packet transmission time stamp and the packet reception time stamp may be used as information included in the packet association information. It should be noted that, the packet association information may further include other information associated with the packet, and optionally, the packet association information may include a data type or a packet transmission protocol in the packet.

In practical application, in the audio/video playing process, historical playing data generated in the audio/video playing process can be obtained, and further, delay corresponding to a current time slice can be predicted according to the obtained historical playing data. Specifically, during the audio/video playing process, a plurality of historical time slices before the current time slice can be determined on the basis of the determined current time slice. Further, for each historical time slice, the data packet associated information corresponding to the historical time slice can be determined according to the audio/video data played in the historical time slice.

Optionally, acquiring packet associated information corresponding to played audio and video data in a plurality of historical time slices includes: determining a plurality of historical time slices of a preset number before the current time slice; and for each historical time slice, acquiring data packet associated information corresponding to a plurality of played data packets in the current historical time slice.

The current time slice is understood as a time slice requiring delay prediction. The preset number may be a value preset for defining the number of acquisitions of the historical time slice. The preset number may be any value, alternatively, may be 50, 100, 150, or the like.

It should be noted that, the historical time slice before the current time slice may be a time slice adjacent to the current time slice and located before the current time slice, and sequentially obtain time slices arranged consecutively until the obtained time slices reach a preset number, and the obtained time slices may be used as the historical time slices. Or, a preset number of time slices can be screened out from a plurality of time slices before the current time slice according to a preset screening rule, and the screened time slices can be used as historical time slices; alternatively, a historical time slice prior to the current time slice may be determined in other ways, which are not specifically limited by the embodiments of the present disclosure.

In practical application, during the audio/video playing process, the current time slice may be determined first, that is, the time slice that needs to be subjected to delay prediction is determined. Further, a predetermined number of time slices before the current time slice may be determined based on the current time slice, and the acquired time slices may be used as historical time slices. Furthermore, because the audio and video frames corresponding to the historical time slices are the audio and video frames which are already played, for the played audio and video data, the audio and video data packets contained in the played audio and video data can complete the audio and video playing process through the steps of data packet sending and data packet receiving. Thus, for each historical time slice, a plurality of played data packets constituting the audio-video data may be determined based on the played audio-video data within the current historical time slice. Furthermore, the data packet association information corresponding to the current historical time slice can be determined according to the association information marked in advance on each played data packet. It should be noted that, the packet association information corresponding to the historical time slice may include packet association information corresponding to all played packets in the historical time slice. The advantages of this arrangement are that: in the audio and video playing process, the played audio data in the historical time slice before the current time slice is used as the prediction basis of the current time slice, so that the effect of adjusting and optimizing the jitter buffer in the audio and video online operation flow is realized.

S120, determining feature data to be used corresponding to each historical time slice based on the data packet association information.

In this embodiment, for each historical time slice, after determining the packet association information corresponding to the current historical time slice, the feature data to be used corresponding to the current historical time slice may be determined based on the packet association information. The feature data to be used is understood to be data which characterize the delay jitter of the data packets in the time slices. As will be appreciated by those skilled in the art, jitter, also known as variation in delay, refers to the variation in delay exhibited by different packets in the same traffic stream. Typically, packets leave the sender at regular intervals, however, the regular intervals are destroyed by the different delays experienced by the packets as they pass through the network, thereby creating jitter. The feature data to be used may be data representing a delay variation of the data packet corresponding to each played data packet included in the historical time slice.

In practical application, for each historical time slice, after obtaining the data packet association information corresponding to the current historical time slice, the data packet delay feature data corresponding to the played data packet can be determined according to the data packet association information of each played data packet included in the current historical time slice, and further, the delay feature data corresponding to the current historical time slice can be determined according to the data packet delay feature data corresponding to the played data packet, so that the feature data to be used corresponding to the current historical time slice can be determined according to the delay feature data corresponding to the current historical time slice. Specifically, for each historical time slice, when determining the data packet delay characteristic data corresponding to each played data packet included in the current historical time slice, the data packet delay characteristic data corresponding to the played data packet may be determined according to the data packet sending time stamp and the data packet receiving time stamp included in the data packet associated information corresponding to each played data packet, and the data packet delay characteristic data corresponding to the played data packet may be determined according to the difference between the data packet receiving time stamp and the data packet sending time stamp. Further, the delay characteristic data corresponding to each historical time slice can be determined according to the delay characteristic data of the data packet corresponding to each played data packet in each historical time slice. Furthermore, the feature data to be used corresponding to each historical time slice can be determined according to the time delay feature data corresponding to each historical time slice and a predetermined feature setting rule. The feature setting rule may be any feature selection rule, and optionally may be a percentile selection rule, a maximum disordered data packet selection criterion, a maximum delay jitter data packet selection criterion, or the like.

S130, determining target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model.

In this embodiment, after obtaining the feature data to be used corresponding to each historical time slice, the feature data to be used corresponding to each historical time slice may be input into a predetermined delay prediction model, so as to obtain a target prediction delay corresponding to the current time slice.

The delay prediction model may be pre-trained, and is used to implement a neural network model for predicting jitter delay of any time slice. In general, a neural network model constructed based on a Deep learning algorithm has a better data processing effect, and the neural network model mainly comprises a convolutional neural network (Convolutional Neural Network, CNN) model, a Deep-Learning Neural Network, DNN) model, a cyclic recurrent neural network (Recurrent Neural Network, RNN) model, a Long Short Term Memory (LSTM) model and the like. Because the feature data to be used corresponding to each historical time slice is one of time sequence, and time correlation exists, a time sequence model with good processing effect on the time sequence, such as RNN or LSTM, can be adopted. In this embodiment, the calculation concurrency performance of RNN is better than LSTM, so it is preferable that the delay prediction model may be RNN. Specifically, RNNs are a type of recurrent neural network that takes sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes are chained. Illustratively, the RNN may include an input layer, a hidden layer, and an output layer, X being a vector representing values of the input layer; s is a vector representing the value of the hidden layer; o is a vector representing the value of the output layer; u is a weight matrix from the input layer to the hidden layer; v is the weight matrix from the hidden layer to the output layer; w is a weight matrix representing the last value of the hidden layer as the weight of this input. In the practical application process, after the network receives the input vector X (t) at the time t, the value of the hidden layer is S (t), and the value of the output layer is O (t). The value S (t) of the RNN concealment layer depends not only on the input X (t) at the current time but also on the value S (t-1) of the concealment layer at the previous time.

It should be noted that the delay prediction model may be other neural network models capable of processing time series data, which is not specifically limited in the embodiments of the present disclosure.

In this embodiment, after determining the feature data to be used corresponding to each historical time slice, a model input sequence may be constructed based on the feature data to be used corresponding to each historical time slice, and then, the constructed model input sequence may be input into the time sequence prediction model, so that the target prediction delay corresponding to the current time slice may be obtained. The target predicted delay may be data characterizing a delay jitter characteristic of the current time slice. The target predicted delay may be delay predicted data that is capable of maximally fitting an actual delay corresponding to the current time slice. For example, with continued reference to fig. 4, in the case where the plurality of historical time slices corresponding to the current time slice are X historical time slices, i.e., historical time slice 1, historical timeThe inter-slices 2 and … and the historical time slices X, wherein each historical time slice corresponds to one feature data to be used, and a data feature sequence [1,2, …, X ] can be constructed according to the feature data to be used corresponding to each historical time slice ]. Wherein each element in the sequence represents feature data to be used corresponding to a corresponding historical time slice. Further, [1,2, …, x ]]Inputting the target prediction delay (x+1) corresponding to the current time slice X+1 into a delay prediction model ^′ 。

And S140, processing the audio and video data packet to be played, which is cached in the jitter buffer area, based on the target prediction delay.

The jitter buffer (jitter buffer) is an important module in the audio/video processing flow. In the development process of short video, the jitter buffer area can be arranged to effectively solve the problems of data packet loss, disorder, delay arrival and the like. The jitter buffer area can smoothly output data packets or audio and video frames to the decoding module, and resists the influence on playing or rendering under various weak network conditions, reduces the frequency of the occurrence of the clamping condition of the audio and video content, and improves the watching experience of users. In general, there may be at least two common jitter buffer arrangements: one is a static jitter buffer implemented in system hardware; one is a dynamic jitter buffer implemented in system software. Whichever arrangement is by adjusting the buffering to accommodate changes in the network. The audio/video data packet to be played can be understood as an audio/video data packet which is cached and not played.

In practical application, after the target prediction delay corresponding to the current time slice is obtained, according to the target prediction delay and the playing time length corresponding to the audio/video data packet to be played, which is cached in the jitter buffer, the data packet transmission parameters of the audio/video data packet to be played, which are included in the jitter buffer, can be adjusted, so that the adjusted jitter buffer can smoothly output the audio/video data packet to the decoding module, and the occurrence frequency of the blocking condition and the delay waste condition is reduced.

Optionally, processing the audio/video data packet to be played, which is buffered in the jitter buffer, based on the target prediction delay includes: if the playing time corresponding to the audio and video data packet to be played cached in the jitter buffer area is longer than the target prediction delay, the conveying rate of the audio and video data packet to be played is improved; if the playing time length corresponding to the audio and video data packet to be played stored in the jitter buffer area is smaller than the preset multiple of the target prediction delay, the transmission efficiency of storing the audio and video data packet to be played is reduced.

The playing time length can be understood as the audio/video playing time length corresponding to all the audio/video data packets to be played, which are cached in the jitter buffer. The delivery rate is understood to be the transmission speed at which data packets are transmitted in the network. The preset multiple may be any number, and optionally, may be 0.75. The transmission efficiency is understood as the speed at which data packets are transmitted from the source to the destination during any time period.

In practical application, the number of audio and video data packets to be played, which are cached in the jitter buffer, can be determined, and then, the playing time length corresponding to the jitter buffer can be determined according to the number of the audio and video data packets to be played. Further, the determined playing duration may be compared with a target prediction delay corresponding to the current time slice. Under the condition that the playing time length is longer than the target prediction delay, the transmission efficiency of the audio and video data packets to be played, which are cached in the jitter buffer, can be improved, so that the audio and video data packets to be played can be accelerated to be transmitted out of the jitter buffer. The advantage of this arrangement is that the length of the jitter buffer can be reduced and the amount of buffered data stored can be reduced by improving the transport efficiency of the audio/video data packets to be played, thereby reducing the end-to-end delay of the audio/video data packets.

Under the condition that the playing time length is smaller than the product of the target prediction delay and the preset multiple, the transmission efficiency of storing the audio and video data packet to be played can be reduced, so that the transmission efficiency of storing the audio and video data packet to be played in the jitter buffer area is reduced, and the buffering time of the audio and video data packet to be played in the jitter buffer area can be increased. The advantages of this arrangement are that: by reducing the transmission efficiency of the data packet to be played, the reserved length of the jitter buffer area can be increased, and furthermore, the blocking can be reduced by increasing the delay.

In practical application, in the process of predicting the target prediction delay corresponding to the current time slice, the audio and video are in the continuous playing process, and then the audio and video data in the current time slice are updated to the played audio and video data. Further, the data packet associated information corresponding to the played audio/video data in the current time slice can be obtained, so that the actual delay corresponding to the current time slice can be determined based on the data packet associated information. Furthermore, training samples can be constructed based on the actual time delay corresponding to the current time slice and the feature data to be used corresponding to the historical time slice so as to train the time delay prediction model, so that the algorithm iteration speed can be effectively reduced, the process of manually marking the data is avoided, and the audio and video processing efficiency is improved.

Based on this, after determining the target prediction delay, further comprising: determining the actual delay corresponding to the current time slice; and taking the feature data to be used corresponding to the historical time slices and the actual delay corresponding to the current time slices as training samples to update model parameters in the delay prediction model.

In this embodiment, the actual delay may be understood as an actual jitter delay determined based on the audio-video data packets included in the current time slice. It should be noted that, the determining manner of the actual delay corresponding to the current time slice is the same as the determining manner of the feature data to be used corresponding to the historical time slice, and the embodiments of the present disclosure are not described in detail herein.

In practical application, after determining the target prediction delay corresponding to the current time slice, a training sample of the delay prediction model can be constructed based on the current time slice. Specifically, the delay to be processed may be determined according to a packet transmission timestamp and a packet reception timestamp of each audio/video packet in the current time slice, so as to determine a set of delays to be processed based on each delay to be processed. Furthermore, a delay set corresponding to the current time slice may be determined according to each delay to be processed in the delay set to be processed. Further, an actual delay corresponding to the current time slice may be determined from the delay set. Further, the feature data to be used corresponding to the plurality of historical time slices and the actual delay corresponding to the current time slice can be combined together to construct the training sample. Further, model parameters of the delay predictive model may be updated based on the training samples.

Fig. 6 is a flowchart of an audio/video processing method according to an embodiment of the disclosure. According to the technical scheme of the embodiment, after the data packet association information corresponding to each played data packet in each historical time slice is determined, for each historical time slice, a first delay corresponding to each played data packet in the current historical time slice can be determined according to the data packet association information of each played data packet in the current historical time slice, a first delay set corresponding to the current historical time slice is determined based on each first delay, further, a delay set to be applied corresponding to each historical time slice can be determined according to each first delay set, and then feature data to be used corresponding to each historical time slice can be determined according to each delay set to be used. Reference is made to the description of this example for a specific implementation. The technical features that are the same as or similar to those of the foregoing embodiments are not described herein.

As shown in fig. 6, the method of this embodiment may specifically include:

s210, acquiring data packet associated information corresponding to the played audio and video data in a plurality of historical time slices in the audio and video playing process.

S220, for each historical time slice, determining at least one first delay according to the data packet sending time stamp and the data packet receiving time stamp of each played data packet in the current historical time slice, so as to determine a first delay set based on the at least one first delay.

In this embodiment, for each historical time slice, the current historical time slice may include a plurality of played data packets therein. For each played data packet included in the current historical time slice, in order to determine the transmission duration of the current played data packet, under the condition of acquiring the data packet associated information corresponding to the current played data packet, a difference between a data packet receiving timestamp and a data packet sending timestamp included in the data packet associated information is obtained, so that a time difference is obtained, and the time difference can be used as a first delay. Accordingly, after determining the first delays corresponding to the current played data packets, the first delays corresponding to each played data packet may be determined based on the above manner, and then, each first delay is collected together, that is, a set may be obtained, and the set may be used as the first delay set.

In practical application, after acquiring the data packet association information corresponding to the played data packets in the historical time slices, for each historical time slice, the data packet sending timestamp and the data packet receiving timestamp corresponding to each played data packet can be determined according to the data packet association information corresponding to each played data packet in the current historical time slice. Further, for each played packet in the current historical time slice, a difference between a packet receiving timestamp and a packet sending timestamp corresponding to the current played packet may be determined, and the difference may be used as the first delay corresponding to the current played packet. Further, after the first delay corresponding to each played data packet in the current historical time slice is obtained, each first delay can be collected together, and then the first delay collection corresponding to the current historical time slice can be obtained. Further, a first set of delays corresponding to each historical time slice may be determined.

S230, determining a delay set to be used corresponding to each historical time slice based on at least one first delay in each first delay set.

In this embodiment, the to-be-used delay set may be understood as a delay set obtained by processing at least one first delay included in the first delay set corresponding to the historical time slice.

In practical application, after determining the first delay set corresponding to each historical time slice, in order to determine the transmission jitter set corresponding to each historical time slice, the delay jitter set corresponding to the historical time slice, that is, the delay set to be used, may be determined according to at least one first delay included in the first delay set corresponding to the historical time slice. Specifically, for each historical time slice, a played data packet with the highest transmission speed in the current historical time slice, that is, a played data packet with the smallest first delay data value, may be determined. The time consumption of the transmission of each played data packet in the current historical time slice can be determined, namely, the time offset of the played data packet with the highest transmission speed is determined, namely, the difference value between the first delay corresponding to each played data packet and the minimum value in the first delay set is determined. Further, a delay difference value corresponding to each played data packet may be obtained, and the delay difference value may be used as a transmission jitter corresponding to the played data packet. And then, the delay difference values corresponding to the played data packets are collected together, and the delay collection to be used corresponding to the current historical time slice can be obtained.

Optionally, determining a set of delays to be used corresponding to each historical time slice based on at least one first delay in each first set of delays includes: and for each first delay set, acquiring the minimum value of at least one first delay in the current first delay set, and respectively determining the difference value between each first delay and the minimum value to obtain a delay set to be used corresponding to the current first delay set.

In this embodiment, the minimum value of at least one first delay in the first delay set may be the first delay corresponding to the played packet with the highest transmission speed among the plurality of played packets corresponding to the first delay set.

In practical application, for each first delay set, a minimum value of at least one first delay in the current first delay set may be obtained, and further, for at least one first delay in the current first delay set, a difference value between each first delay and the minimum value may be determined, where the difference value may be used as a transmission jitter corresponding to a corresponding played packet. And after determining the transmission jitter corresponding to each played data packet, obtaining a delay set to be used corresponding to the current first delay set. The delay set to be used may include a difference between a first delay corresponding to each played packet and a minimum value in the first delay set. The advantages of this arrangement are that: the transmission jitter of each played data packet in the current historical time slice is effectively determined, and further, the transmission jitter set corresponding to the historical time slice can be accurately constructed.

And S240, determining feature data to be used corresponding to each historical time slice based on each delay set to be used and a preset percentile.

The preset percentile may be understood as a preset percentile for screening each delay feature to be used in the delay set to be used. It will be appreciated by those skilled in the art that percentile values are a term in statistics. In particular, if a set of data is ordered from small to large and a corresponding cumulative percentile is calculated, the value of the data corresponding to a percentile may be referred to as the percentile of that percentile. Can be expressed as: a set of N observations is arranged in numerical sizes. For example, a value at the p% position is referred to as the p-th percentile. In this embodiment, the preset percentile may be any percentile, and optionally, may be PCT95 and/or PCT99. For example, assuming that the preset percentile is PCT95, the number of played data packets included in the historical time slice is 100, that is, the number of delays to be used included in the set of delays to be used is 100. And arranging the 100 delays to be used according to the value from small to large, wherein the position is the 95 th value, namely the delay to be used corresponding to PCT 95.

In practical application, for each delay set to be used, the delays to be used in the current delay set to be used can be arranged from small to large according to the value, and then the delays to be used meeting the requirements can be selected from the arranged delay sets to be used according to the preset percentile. Further, the feature data to be used corresponding to the current historical time slice can be determined according to the selected delay to be used.

It should be noted that the preset percentile may be one or more. In the case of a plurality of preset percentiles, a desired delay to be used may be selected from the set of aligned delays to be used on a per percentile basis, respectively, that is, the number of values selected from the set of delays to be used may be one or more. Illustratively, assuming that the preset percentiles are PCT95 and PCT99, the number of played data packets included in the historical time slice is 100, i.e., the number of delays to be used included in the set of delays to be used is 100. And arranging the 100 delays to be used according to the value from small to large, wherein the value of the 95 th bit and the value of the 99 th bit are the feature data to be used corresponding to the historical time slice.

It should be noted that, when determining the feature data to be used, information such as packet loss may be used as the feature data to be used, specifically, the played data packet with the largest disorder or the played data packet with the largest jitter may be determined according to each latency to be used included in the current latency set, and further, the corresponding latency to be used of the played data packets in the latency set may be used as the feature data to be used.

S250, determining target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model.

And S260, processing the audio and video data packet to be played, which is cached in the jitter buffer area, based on the target prediction delay.

According to the technical scheme, data packet association information corresponding to played audio and video data in a plurality of historical time slices is obtained in the process of playing the audio and video, then, for each historical time slice, according to a data packet sending timestamp and a data packet receiving timestamp of each played data packet in the current historical time slice, first delay is determined, so that a first delay set is determined based on each first delay, further, a to-be-used delay set corresponding to each historical time slice is determined based on each first delay in each first delay set, then, to-be-used characteristic data corresponding to each historical time slice is determined based on each to-be-used delay set and a preset percentile, a target prediction delay corresponding to the current time slice is determined based on the to-be-used characteristic data and a preset delay prediction model, finally, the to-be-played audio and video data packets cached in a jitter buffer are processed based on the target prediction delay, the effect of effectively reducing jitter characteristic dimension is achieved, and further, the jitter effect can be predicted accurately under the condition that the terminal force is limited.

Fig. 7 is a flowchart of an audio/video processing method according to an embodiment of the disclosure. According to the technical scheme of the embodiment, on the basis of the embodiment, before the feature data to be used is processed based on the delay prediction model, the training sample data comprising a sample delay set formed by a plurality of historical data packets and actual delay corresponding to the prediction time can be obtained, and further, the delay prediction model can be trained based on the plurality of training sample data, so that the delay prediction model is obtained. Reference is made to the description of this example for a specific implementation. The technical features that are the same as or similar to those of the foregoing embodiments are not described herein.

As shown in fig. 7, the method of this embodiment may specifically include:

s310, training to obtain a delay prediction model.

In this embodiment, training sample data for training the delay prediction model may be constructed, and further, the delay prediction model may be trained based on the training sample data to obtain a trained delay prediction model.

Optionally, training to obtain a delay prediction model includes: acquiring a plurality of training sample data; and training the delay prediction model based on the plurality of training sample data to obtain the delay prediction model.

The training sample data comprises a sample delay set formed by a plurality of historical data packets and actual delay corresponding to the predicted time. The historical data packet may be understood as a played audio/video data packet included in the historical time. The sample delay set may be understood as a set constructed based on feature data to be used corresponding to the historical time, or may be understood as a set constructed based on jitter delay data corresponding to the historical data packet. The predicted time instant may be understood as the time instant at which the delay prediction is to be performed. The actual delay may be an actual jitter delay determined based on the audio-video data packets included in the predicted time instant.

In practical applications, a plurality of training sample data needs to be built in advance before the delay prediction model to be trained is trained, so that the model is trained based on the training sample data. In order to improve the accuracy of the model, training sample data can be constructed as much and as abundant as possible. Specifically, the historical delays corresponding to the historical data packets corresponding to the plurality of historical moments can be obtained, and further, a sample delay set can be constructed according to the historical delays corresponding to the plurality of historical data packets, and the actual delay of the predicted moment to the drink can be determined, so that training sample data can be constructed based on the sample delay set and the actual delay.

Further, after obtaining a plurality of training sample data, the training sample data can be input into a delay prediction model to be trained to obtain a prediction delay corresponding to the prediction time, and further, the delay prediction model to be trained can be trained based on the prediction delay and the actual delay to obtain the delay prediction model.

Optionally, training the delay prediction model based on the plurality of training sample data to obtain the delay prediction model, including: for each piece of training sample data, inputting characteristic data in a sample delay set in the current training sample data into a delay prediction model, and predicting to obtain model prediction delay corresponding to the current training sample data; determining a target loss value based on the model predictive delay and the actual delay in the current training sample data; a delay prediction model is determined based on the target loss value.

The feature data may be understood as delay feature data determined based on each historical delay included in the sample delay set and a preset feature screening rule. Model predictive latency can be understood as the result of latency prediction output after feature data is input to the latency predictive model. The target loss value may be understood as the difference between the model predicted delay and the actual delay.

In practical application, for each training sample data, feature data in a sample delay set in the current training sample data can be input into a delay prediction model, so that model prediction delay corresponding to a prediction time in the current training sample data can be obtained. Further, the model predicted delay may be compared to the actual delay in the current training sample data to determine a target loss value. The mode of determining the target loss value when the model predicted delay is larger than the actual delay is different from the mode of determining the target loss value when the model predicted delay is smaller than the actual delay, and the two modes can be described separately.

Optionally, in the case that the model prediction delay is greater than the actual delay, determining the target loss value based on the model prediction delay and the actual delay in the current training sample data includes: if the model predicted delay is greater than the actual delay and the model predicted delay is greater than a preset delay threshold, determining a target loss value based on the first objective function.

The preset delay time threshold may be understood as a predetermined maximum value of the delay time. The preset delay time delay threshold may be any value, alternatively, may be 400 milliseconds, etc. The purpose of setting the preset delay time threshold to 400 ms is to: in general, the delay exceeds 400 milliseconds, the user experience is deteriorated, and when an objective function for determining a target loss value is set, the blocking is ensured under the condition that the delay is lower than 400 milliseconds; in the case where the delay exceeds 400 milliseconds, the balance of delay and stuck needs to be comprehensively considered.

In this embodiment, in the case where the model predicted delay is greater than the actual delay, the difference between the model predicted delay and the actual delay is greater than zero, which may be a delay loss. The first objective function may be understood as a function preset for determining the objective loss value in case the delay loss is larger than zero and the model predictive delay is larger than a preset delay threshold.

In practical application, under the condition that the model prediction delay is larger than the actual delay and the model prediction delay is larger than the preset delay threshold, the difference between the model prediction delay and the preset delay threshold can be determined to obtain a first value to be processed, then the first value to be processed can be multiplied by a preset delay coefficient to obtain a second value to be processed, then the second value to be processed can be added with the delay loss value, and accordingly the value obtained after the addition can be used as a target loss value. Wherein the delay loss value may be determined by determining a difference between the model predicted delay and the actual delay. The advantages of this arrangement are that: under the condition that the model prediction delay is larger than the actual delay and the model prediction delay is larger than a preset delay threshold, the obtained target loss value can realize the effect of balancing delay and jamming.

By way of example, the first objective function may be expressed based on the following formula:

Loss＝delayloss+k*((x+1) ^′ -delay_thresh)

delayloss＝(x+1) ^′ -(x+1)

where Loss may represent a target Loss value; delayloss can represent delay lossA value; k may represent a preset delay factor, alternatively k=1.25; (x+1) ^′ Model prediction delay may be represented; (x+1) may represent the actual delay; delay _ thresh may represent a preset delay time threshold.

Optionally, in the case that the model prediction delay is smaller than the actual delay, determining the target loss value based on the model prediction delay and the actual delay in the current training sample data includes: if the model prediction delay is smaller than the actual delay, determining a difference value between the model prediction delay and the actual delay to obtain a first numerical value; and determining a target loss value based on the first numerical value and a preset katon delay threshold.

In this embodiment, when the model prediction delay is smaller than the actual delay, the actual delay and the model prediction delay are differenced, and the obtained difference is greater than zero, and the difference can be used as the stuck loss value, that is, the first value. The preset stuck time delay threshold may be understood as a preset maximum stuck time delay. The preset jam delay threshold may be any value, alternatively 200 ms or 250 ms. In the actual application process, the preset click-through delay threshold can be determined based on the audio and video playing requirement, and can be set to be a relatively high value under the condition that the audio and video playing requirement is biased to be smooth; in the case where the audio/video playback demand is biased towards low latency, the preset jam delay threshold may be set to a relatively low value.

In practical application, under the condition that the model prediction delay is smaller than the actual delay, the difference between the model prediction delay and the actual delay can be determined to obtain a first value. And then, comparing the first value with a preset stuck time delay threshold, and further, determining a target loss value based on a corresponding objective function respectively under the condition that the first value is smaller than or equal to the preset stuck time delay threshold and under the condition that the first value is larger than the preset stuck time delay threshold. The advantages of this arrangement are that: and under the condition that the model prediction delay is smaller than the actual delay, accurately determining the effect of the corresponding target loss function.

Optionally, determining the target loss value based on the first value and the preset stuck delay includes: if the first value is smaller than or equal to a preset blocking delay threshold value, determining a target loss value based on a second objective function; and if the first value is larger than the preset katon delay threshold, determining a target loss value based on a third objective function.

The second objective function may be understood as a function for determining the target loss value when the first value is less than or equal to a preset katon delay threshold. The third objective function may be understood as a function for determining the target loss value in case the first value is larger than a preset caton delay threshold.

In practical application, when the first value is less than or equal to the preset katon delay threshold, the first value may be multiplied by the first preset katon coefficient, and the value obtained after the multiplication may be used as the target loss value.

By way of example, the second objective function may be expressed based on the following formula:

Loss＝c*stallloss

where Loss may represent a target Loss value; c may represent a first preset click-through coefficient, optionally c=7.5; the stallloss may represent a first value.

In practical application, when the first value is greater than the preset katon delay threshold, the first preset katon coefficient and the preset katon delay threshold may be multiplied to obtain a third value to be processed. And determining a difference value between the first value and a preset katon delay threshold value to obtain a fourth value to be processed. And multiplying the fourth to-be-processed value by a second preset katon coefficient to obtain a fifth to-be-processed value. And adding the third to-be-processed value and the fifth to-be-processed value, and taking the value obtained after adding as a target loss value.

By way of example, the third objective function may be expressed based on the following formula:

Loss＝c*stall_thresh+d*(stallloss-stall_thresh)

where Loss may represent a target Loss value; c may represent a first preset click-through coefficient, optionally c=7.5; the stall_thresh may represent a preset stuck delay threshold; d may represent a second preset click-through coefficient, optionally d=2.0; the stallloss may represent a first value.

It should be noted that, the determination of the target loss value based on the above manner has the following advantages: the self-adaptive adjustment can be carried out according to different conditions, so that the training effect of the delay prediction model is improved, and the target delay can be predicted under different conditions by the trained delay prediction model.

Further, after the target loss value is determined, the model parameters in the delay prediction model may be modified according to the target loss value. Specifically, the loss function convergence in the delay prediction model may be used as a training target, such as whether the training error is smaller than a preset error, whether the error variation trend tends to be stable, or whether the current iteration number is equal to the preset number. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error, or the error change trend tends to be stable, the delay prediction model training is completed, and at the moment, the iterative training can be stopped. If the current condition is detected not to be met, other training sample data can be further obtained to train the delay prediction model continuously until the training error of the loss function is within a preset range. When the training error of the loss function reaches convergence, the trained delay prediction model can be used as a final delay prediction model, namely, after feature data to be used corresponding to a plurality of historical time slices are input into the delay prediction model, the target prediction delay corresponding to the current time slice can be accurately obtained.

It should be noted that, the on-line operation data, that is, the audio and video data played in real time, may be obtained in real time to predict the target delay, and the model may be updated by taking the on-line operation data as a training sample after the prediction is completed and the actual delay is obtained.

S320, acquiring data packet associated information corresponding to the played audio and video data in a plurality of historical time slices in the audio and video playing process.

S330, determining feature data to be used corresponding to each historical time slice based on the data packet association information.

S340, determining target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model.

S350, processing the audio and video data packet to be played, which is cached in the jitter buffer area, based on the target prediction delay.

According to the technical scheme, a delay prediction model is obtained through training, then, in the process of playing the audio and video, data packet associated information corresponding to played audio and video data in a plurality of historical time slices is obtained, further, feature data to be used corresponding to each historical time slice is determined based on the data packet associated information, further, the target prediction delay corresponding to the current time slice is determined based on the feature data to be used and a predetermined delay prediction model, finally, the audio and video data packets to be played, which are cached in a jitter buffer area, are processed based on the target prediction delay, a large amount of line reporting data are obtained, and training is performed on the delay prediction model, so that the effect of self-adapting learning algorithm features of the delay prediction model is achieved, the algorithm iteration period is further shortened, and the prediction accuracy and efficiency of jitter delay are improved.

Fig. 8 is a schematic structural diagram of an audio/video processing apparatus according to an embodiment of the present disclosure, as shown in fig. 8, where the apparatus includes: the association information acquisition module 410, the feature data determination module 420, the delay prediction module 430, and the packet processing module 440.

The association information obtaining module 410 is configured to obtain, during the process of playing the audio and video, corresponding packet association information of the played audio and video data in a plurality of historical time slices; a feature data determining module 420, configured to determine feature data to be used corresponding to each historical time slice based on the packet association information; the delay prediction module 430 is configured to determine a target prediction delay corresponding to the current time slice based on the feature data to be used and a predetermined delay prediction model; and the data packet processing module 440 is configured to process the audio/video data packet to be played, which is cached in the jitter buffer, based on the target prediction delay.

Based on the above aspects, the association information obtaining module 410 includes: a time slice determining unit and an associated information acquiring unit. A time slice determining unit for determining a plurality of historical time slices of a preset number before the current time slice; the related information acquisition unit is used for acquiring data packet related information corresponding to a plurality of played data packets in the current historical time slice for each historical time slice; the played audio and video data is composed of a plurality of played data packets, and the data packet associated information comprises a data packet sending time stamp and a data packet receiving time stamp.

On the basis of the above aspects, the feature data determining module 420 includes: the device comprises a delay determining unit, a delay set determining unit and a characteristic data determining unit. The delay determining unit is used for determining at least one first delay according to the data packet sending time stamp and the data packet receiving time stamp of each played data packet in the current historical time slice for each historical time slice so as to determine a first delay set based on the at least one first delay; a delay set determining unit, configured to determine a delay set to be used corresponding to each historical time slice based on at least one first delay in each first delay set; and the characteristic data determining unit is used for determining the characteristic data to be used corresponding to each historical time slice based on each delay set to be used and the preset percentile.

On the basis of the above technical solutions, the delay set determining unit is specifically configured to obtain, for each first delay set, a minimum value of at least one first delay in a current first delay set, and determine a difference value between each first delay and the minimum value, so as to obtain a delay set to be used corresponding to the current first delay set.

Based on the above aspects, the packet processing module 440 includes: a transport rate increasing unit and a transport rate decreasing unit.

The transmission rate increasing unit is used for increasing the transmission rate of the audio/video data packet to be played if the playing time length corresponding to the audio/video data packet to be played, which is cached in the jitter buffer, is longer than the target prediction delay;

and the transmission rate reducing unit is used for reducing the transmission efficiency of storing the audio and video data packets to be played if the playing time length corresponding to the audio and video data packets to be played stored in the jitter buffer is smaller than the preset multiple of the target prediction delay.

On the basis of the technical schemes, the device further comprises: and the delay prediction model training module is used for training to obtain the delay prediction model.

The delay prediction model training module comprises: the sample data acquisition sub-module and the delay prediction model training sub-module;

the sample data acquisition sub-module is used for acquiring a plurality of training sample data; the training sample data comprises a sample delay set formed by a plurality of historical data packets and an actual delay corresponding to a predicted time; and the delay prediction model training submodule is used for training the delay prediction model based on the plurality of training sample data to obtain the delay prediction model.

On the basis of the above technical solutions, the delay prediction model training submodule includes: the model prediction delay determining unit, the loss value determining unit and the delay prediction model determining unit;

the model prediction delay determining unit is used for inputting characteristic data in a delay set in the current training sample data into the delay prediction model for each training sample data, and predicting to obtain model prediction delay corresponding to the current training sample data; a loss value determining unit, configured to determine a target loss value based on the model prediction delay and an actual delay in the current training sample data; a delay prediction model determining unit for determining the delay prediction model based on the target loss value

On the basis of the above technical solutions, the loss value determining unit is specifically configured to determine, based on a first objective function, a target loss value if the model prediction delay is greater than the actual delay and the model prediction delay is greater than a preset delay threshold.

On the basis of the above technical solutions, the loss value determining unit includes: a delay difference value determination subunit and a loss value determination subunit; the delay difference value determining subunit is used for determining the difference value between the model prediction delay and the actual delay to obtain a first numerical value if the model prediction delay is smaller than the actual delay; and the loss value determining subunit is used for determining the target loss value based on the first numerical value and a preset katon delay.

On the basis of the above technical solutions, the loss value determining subunit is specifically configured to determine, if the first value is less than or equal to the preset katon delay, the target loss value based on a second objective function; and if the first value is larger than the preset stuck delay, determining the target loss value based on a third objective function.

On the basis of the technical schemes, the device further comprises: the actual delay determining module and the model parameter updating module.

The actual delay determining module is used for determining the actual delay corresponding to the current time slice after determining the target delay; a model parameter updating module, configured to update model parameters in the delay prediction model by using feature data to be used corresponding to the plurality of historical time slices and actual delay corresponding to the current time slice as training samples

The audio and video processing device provided by the embodiment of the disclosure can execute the audio and video processing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 9, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 9) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 9 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 9, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the audio/video processing method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the audio/video processing method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. An audio/video processing method, comprising:

2. The method of claim 1, wherein the obtaining packet association information corresponding to the played audio/video data in the plurality of historical time slices comprises:

determining a plurality of historical time slices of a preset number before the current time slice;

For each historical time slice, acquiring data packet associated information corresponding to a plurality of played data packets in the current historical time slice;

the played audio and video data is composed of a plurality of played data packets, and the data packet associated information comprises a data packet sending time stamp and a data packet receiving time stamp.

3. The method of claim 1, wherein determining feature data to be used corresponding to each historical time slice based on the packet association information comprises:

for each historical time slice, determining at least one first delay according to a data packet sending time stamp and a data packet receiving time stamp of each played data packet in the current historical time slice, so as to determine a first delay set based on the at least one first delay;

determining a set of delays to be used corresponding to each historical time slice based on at least one first delay in each first set of delays;

and determining feature data to be used corresponding to each historical time slice based on each delay set to be used and a preset percentile.

4. A method according to claim 3, wherein said determining a set of delays to be used corresponding to each historical time slice based on at least one first delay within each first set of delays comprises:

And for each first time delay set, acquiring the minimum value of at least one first time delay in the current first time delay set, and respectively determining the difference value between each first time delay and the minimum value to obtain a time delay set to be used corresponding to the current first time delay set.

5. The method according to claim 1, wherein the processing the audio-video data packets to be played buffered in the jitter buffer based on the target prediction delay comprises:

if the playing time corresponding to the audio and video data packet to be played, which is cached in the jitter buffer area, is longer than the target prediction delay, the conveying rate of the audio and video data packet to be played is increased;

and if the playing time length corresponding to the audio and video data packet to be played stored in the jitter buffer is smaller than the preset multiple of the target prediction delay, reducing the transmission efficiency of storing the audio and video data packet to be played.

6. The method as recited in claim 1, further comprising:

training to obtain the delay prediction model;

the training to obtain the delay prediction model includes:

acquiring a plurality of training sample data; the training sample data comprises a sample delay set formed by a plurality of historical data packets and an actual delay corresponding to a predicted time;

Training the delay prediction model based on the plurality of training sample data to obtain the delay prediction model.

7. The method of claim 6, wherein training the delay prediction model based on the plurality of training sample data results in the delay prediction model, comprising:

for each training sample data, inputting characteristic data in a delay set in the current training sample data into the delay prediction model, and predicting to obtain model prediction delay corresponding to the current training sample data;

determining a target loss value based on the model predictive delay and an actual delay in the current training sample data;

and determining the delay prediction model based on the target loss value.

8. The method of claim 7, wherein determining a target loss value based on the model predicted delay and an actual delay in current training sample data comprises:

and if the model prediction delay is greater than the actual delay and the model prediction delay is greater than a preset delay threshold, determining a target loss value based on a first objective function.

9. The method of claim 7, wherein determining a target loss value based on the model predicted delay and an actual delay in current training sample data comprises:

If the model prediction delay is smaller than the actual delay, determining a difference value between the model prediction delay and the actual delay to obtain a first numerical value;

and determining the target loss value based on the first numerical value and a preset stuck delay threshold.

10. The method of claim 9, wherein the determining the target loss value based on the first value and a preset stuck delay threshold comprises:

if the first value is smaller than or equal to the preset stuck delay threshold, determining the target loss value based on a second objective function;

and if the first value is larger than the preset blocking delay threshold value, determining the target loss value based on a third objective function.

11. The method of claim 1, further comprising, after determining the target prediction delay:

determining the actual delay corresponding to the current time slice;

and taking the feature data to be used corresponding to the historical time slices and the actual time delay corresponding to the current time slice as training samples to update model parameters in the time delay prediction model.

12. An audio/video processing apparatus, comprising:

The associated information acquisition module is used for acquiring data packet associated information corresponding to played audio and video data in a plurality of historical time slices in the process of playing the audio and video;

13. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the audio-video processing method of any of claims 1-11.

14. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the audio video processing method of any of claims 1-11.