CN115996292A

CN115996292A - Method, device, computer equipment and storage medium for training flow control model

Info

Publication number: CN115996292A
Application number: CN202210265930.4A
Authority: CN
Inventors: 杨梁; 黄飞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2023-04-21
Also published as: CN113660488A; CN113660488B

Abstract

The application relates to a training method of a flow control model, which comprises the following steps: for each offline period, generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model; determining an offline cumulative rewards reference value of each offline period according to the offline sample data of each offline period; performing offline period rolling training on the basic flow control model based on offline sample data of a plurality of offline periods and offline accumulated rewards reference values to obtain an intermediate flow control model; for each online period, determining an online jackpot reference value for the current online period from online sample data for the current online period; and performing online period rolling training on the intermediate flow control model based on the online sample data of a plurality of online periods and the online accumulated rewards reference value to obtain a target flow control model suitable for flow control data prediction in the multimedia communication process. The target flow control model of the method can improve flow control accuracy.

Description

Method, device, computer equipment and storage medium for training flow control model

The present application is filed by the chinese patent office at 10 months 18 of 2021, with application number 202111211909.8, and the divisional application entitled "method and apparatus for training a stream control and stream control model for multimedia data", the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and apparatus for training a flow control model, a computer device, and a storage medium.

Background

With the development of network technology, more and more multimedia data transmission scenes need to acquire network states in real time, so that the flow control of multimedia data transmission is realized according to the acquired network states. Taking a multimedia data transmission scenario as an example of a voice or video Real-time call scenario, voice or video Real-time call is typically implemented using a network RTC (Real-Time Communication, real-time call). In RTC calls, it is often necessary to monitor the network status in real time and then modify the configuration of the overall call based on the real-time network status. For example, if the network state is good, the coding rate may be increased. Therefore, how to timely feed back complex and variable network states in multimedia data transmission is a popular problem.

In the traditional scheme, the current self-adaptive code rate control algorithm uses more GCC algorithm in WebRtc, namely network congestion control algorithm for real-time media communication, but a certain delay is generated when the GCC algorithm is actually used, and the self-adaptive code rate control algorithm is excessively dependent on empirical configuration, so that the problem of inaccurate flow control exists in actual control.

Disclosure of Invention

Based on the foregoing, it is necessary to provide a method, an apparatus, a computer device and a storage medium for training a flow control model, which can improve the accuracy of flow control.

A method of training a flow control model, the method comprising:

acquiring a basic flow control model obtained by pre-training a plurality of pre-training sample sets;

for each offline period in the offline training, generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model; the offline sample data comprises offline coding data and offline communication state data;

determining an offline accumulated rewards reference value of each offline period according to the offline sample data of each offline period;

performing offline period rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to each offline period until the offline training stopping condition is reached, and obtaining an intermediate flow control model;

For each online period in online training, determining an online accumulated rewards reference value of the current online period through online sample data of the current online period;

and performing online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to the online periods respectively until the online training stopping condition is reached, and obtaining a target flow control model suitable for flow control data prediction in the multimedia communication process.

A fluidic model training device, the device comprising:

the acquisition module is used for acquiring a basic flow control model obtained by pre-training through a plurality of pre-training sample sets;

the generating module is used for generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model for each offline period in the offline training; the offline sample data comprises offline coding data and offline communication state data;

the first determining module is used for determining an offline accumulated rewards reference value of each offline period according to the offline sample data of each offline period;

The offline training module is used for performing offline period rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to a plurality of offline periods respectively until the offline training stopping condition is reached, so as to obtain an intermediate flow control model;

the second determining module is used for determining an online accumulated rewards reference value of the current online period through online sample data of the current online period for each online period in online training;

and the online training module is used for carrying out online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to a plurality of online periods respectively until the online training stop condition is reached, so as to obtain a target flow control model suitable for carrying out flow control data prediction in the multimedia communication process.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

A computer program product or computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the computer instructions being executed by the processor causing the computer device to perform the steps of: acquiring a basic flow control model obtained by pre-training a plurality of pre-training sample sets; for each offline period in the offline training, generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model; the offline sample data comprises offline coding data and offline communication state data; determining an offline accumulated rewards reference value of each offline period according to the offline sample data of each offline period; performing offline period rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to each offline period until the offline training stopping condition is reached, and obtaining an intermediate flow control model; for each online period in online training, determining an online accumulated rewards reference value of the current online period through online sample data of the current online period; and performing online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to the online periods respectively until the online training stopping condition is reached, and obtaining a target flow control model suitable for flow control data prediction in the multimedia communication process.

According to the method, the device, the computer equipment, the storage medium and the computer program for training the flow control model, the basic flow control model can be close to the decision mode of the flow control model of the previous version by pre-training, so that the flow control model which is not pre-trained is prevented from being put into use, discomfort of a user is possibly caused, and further the product experience of the user is reduced. After the pre-training is finished, offline training is performed based on the offline sample data and the offline accumulated rewards reference value by simulating the change of the coded data and the communication state data offline, and the flow control model can be trained as much as possible before being formally applied online through offline training, so that the accuracy of flow control decision is improved. After the off-line training is completed, the on-line training is finally executed, and corresponding flow control decisions are improved through sensing the real-time state, predicting in real time and self-adapting. The flow control model can be continuously updated by online training, so that the prediction effect of the flow control model can be continuously improved, the flow control accuracy is greatly improved, and the multimedia communication quality and the user experience are further improved.

Drawings

Fig. 1 is an application environment diagram of a method for performing flow control on multimedia data transmission in one embodiment;

Fig. 2 is a flow chart of a method for performing flow control on multimedia data in one embodiment;

FIG. 3 is a schematic diagram of the architecture of a multi-headed attention process in one embodiment;

FIG. 4 is a schematic diagram of a structure of an intelligent object model according to an embodiment;

FIG. 5 is a schematic diagram of interaction based on an agent model in one embodiment;

FIG. 6 is a flow chart of a method of training a flow control model in one embodiment;

FIG. 7 is a flow chart of a method of training a flow control model according to another embodiment;

FIG. 8 is a schematic diagram of a process of offline training in one embodiment;

FIG. 9 is a schematic diagram of an environment simulator in one embodiment;

fig. 10 is a block diagram of an apparatus for streaming multimedia data in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment;

FIG. 12 is a block diagram of a flow control model training apparatus in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Fig. 1 is an application environment diagram of a method for performing stream control on multimedia data in one embodiment. Referring to fig. 1, the method of streaming multimedia data is applied to a multimedia streaming system. The multimedia streaming system comprises a first terminal 102, a second terminal 104 and a server 106. The first terminal 102, the second terminal 104 and the server 106 may be separately used to execute the method for performing the flow control on the multimedia data provided in the embodiment of the present application, or may be cooperatively used to execute the method for performing the flow control on the multimedia data provided in the embodiment of the present application. Taking the method that the first terminal 102, the second terminal 104 and the server 106 cooperate to perform the streaming control on the multimedia data provided in the embodiment of the present application as an example, the following description will be given: the first terminal 102 and the second terminal 104 are both provided with multimedia communication clients, and multimedia communication, such as audio/video communication, can be performed between the first terminal 102 and the second terminal 104 through the server 106. In this process, the server may obtain current encoded data and communication status data reported by the multimedia communication clients of the first terminal 102 and the second terminal 104, respectively. And further, the method for performing flow control on the multimedia data mentioned in the embodiment of the present application is performed based on the data respectively reported by each terminal, so as to predict and obtain the target flow control data (which may also be understood as a target flow control policy) corresponding to the corresponding terminal, and then the target flow control data is acted on the client running by each terminal, so as to perform flow control processing on the multimedia data generated in the next period.

It should be noted that the first terminal and the second terminal are only illustrative, and there may be only one terminal or more than two terminals in the actual use process, which is not limited in the embodiment of the present application.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, a smart television, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The application also relates to the field of artificial intelligence (Artificial Intelligence, AI), which is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses environment, acquires knowledge and uses knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. As will be readily appreciated, the present application relates specifically to supervised learning, reinforcement learning, attention mechanisms, generation of countermeasure networks and multitasking learning in the field of artificial intelligence.

Among them, supervised learning is a method of machine learning, in which a pattern (function/model) can be learned or built from training data, and new instances can be deduced from the pattern. Training data is composed of input data and expected output. The output of the function may be a continuous value (known as regression analysis) or a predictive classification label (known as classification).

Reinforcement learning (Reinforcement Learning, RL): reinforcement learning is one area of machine learning that emphasizes how to act on an environmental basis to achieve the greatest expected benefit. Reinforcement learning is a third basic machine learning method other than supervised learning and unsupervised learning. Unlike supervised learning, reinforcement learning does not require labeled data pairs, nor does reinforcement learning require accurate correction of non-optimal solutions. The focus is to find a balance between Exploration (for unknown domain) and utilization (for existing knowledge), and to strengthen the exchange of "Exploration-utilization" in learning.

Attention mechanism (Attention Mechanism): the attention mechanism derives from a process that mimics biological observations, i.e., a mechanism that aligns internal experiences with external stimuli to increase attention to a partial region.

Generating an antagonism network (Generative Adversarial Networks, GAN): the GAN is an unsupervised learning method, which learns by means of two neural networks playing with each other. The generation of the countermeasure network consists of a generator and a discriminator. The generator samples randomly from the potential space as input, and its output results need to mimic as much as possible the real samples in the training set. The input of the discriminator is then the true sample or the output of the generator, the purpose of which is to distinguish the output of the generator as far as possible from the true sample. The generator should spoof the discriminator as much as possible. The two networks are mutually opposed and continuously adjust parameters, and the final purpose is that the discriminator cannot judge whether the output result of the generator is true or not.

Multitasking learning: a plurality of related tasks are put together for learning, and the information of the related fields which are learned is shared and complemented with each other through shallow shared representation in the learning process, so that the multi-tasks are promoted with each other.

Before explaining the method for performing stream control on the multimedia data provided by the application, the following concept of stream control is introduced:

in the embodiment of the application, the flow control is mainly performed on the multimedia data transmission. The flow control process can be understood as adjusting the transmission configuration of the multimedia data. In particular, the adjustment is mainly made to the encoded data used in the transmission of the multimedia data. For example, taking multimedia data as video, the flow control process can be to adjust the encoding data such as resolution ratio during video transmission in real-time communication process.

It will be appreciated that the flow control process is typically performed once at intervals, i.e. the flow control data at the time of subsequent multimedia data transmission is predicted based on the acquired current communication state data at intervals. It follows that the flow control process is periodic in nature. Based on the above description, the flow control process mentioned in the present application is actually to predict flow control data used in the next period of multimedia data transmission based on some data of the current period, such as communication status data. In the embodiment of the present application, the flow control data of the next period is regarded as target flow control data, and the target flow control data can be used for indicating the coding data used in the next period.

In some embodiments, as shown in fig. 2, a method for streaming multimedia data is provided, and the method is applied to a computer device, where the computer device may be a terminal or a server in fig. 1, and the method for streaming multimedia data includes the following steps:

step S202, in the process of multimedia communication, current coding data in the current period and communication state data generated by the flow control processing of the current coding data are obtained; the current encoded data is determined based on historical flow control data of a previous period.

The multimedia communication is a process of transmitting communication based on multimedia data, and specifically may be transmission of audio/video data. For example, audio and video instant messaging is performed, or transmission and playing of audio and video data are performed. The current encoded data is encoded data in a current period, and the encoded data is an encoding parameter for guiding the multimedia communication client to perform multimedia data presentation, and may specifically include at least one of an encoding code rate, a resolution, or a frame rate. The communication status data is data representing the status of multimedia communication in the current communication period, and specifically can be generated by the combined action of the current network environment and the coded data. The communication status data includes at least one of a network packet loss rate, network delay information, network jitter information, or a katana rate.

It should be noted that, based on the definition of the above-mentioned sub-period flow control, the "current period" may refer to a period of time, and may be specifically defined by two moments. For example, if the current time is 8 minutes at 1 pm and the length of the "current period" is 8 minutes, another time for determining the current period is 1 minute at pm, and thus the period from 1 minute at pm to 1 minute at 8 minutes at pm is the "current period". Based on the above, taking the multimedia communication process as an example of the video real-time call process, the current encoded data in the current period refers to the video encoded data actually used in the current period. In connection with the example of "current period" in the above example, the current encoded data may refer to video encoded data used from 1 pm to 1 pm 8 minutes.

It will be appreciated that the communication state data generated by the flow control processing of the current encoded data represents the network communication state that the current period assumes after the current video encoded data has been used. For example, when the current encoded data is determined at 1 pm, the current encoded data may be subjected to the flow control process during a period from 1 pm to 8 pm (current period), and after the flow control process is finished, the network state represented by the current encoded data at 1 pm, i.e., the communication state data generated by the flow control process performed by the current encoded data, may be represented by the current encoded data at 8 pm.

The communication status data generated by the flow control processing of the current encoded data may be obtained by integrating the current encoded data in the current period based on the network communication status presented in the whole period, or may be a status presented at the end time of the current period as described above. The embodiments of the present application are not limited in this regard.

As for the origin of the current encoded data, it is known from the above procedure that the "target flow control data may be used to indicate the encoded data used in the next period", so that the historical flow control data of the previous period may also be used to indicate the current encoded data used in the current period. Likewise, the current encoded data may be determined based on historical flow control data of a previous period. In particular, it may be determined by historical flow control data of a previous cycle.

It should be noted that, since multimedia data generally occurs between a terminal and a server, the server may generally monitor a network state, so that communication state data may be acquired from the server. As for the encoded data, it can be directly acquired from a client for multimedia transmission at the terminal side.

In some embodiments, taking an audio/video real-time communication scenario as an example, the overall audio/video real-time communication process is complex, the network environment in the user communication process fluctuates, and terminal devices are also various. Therefore, in the process of audio and video real-time communication of the user, the client and the server report the communication state data and the coding data, and the background flow control logic outputs the current flow control decision (namely the target flow control data) after processing, so that the stability of real-time communication is ensured as much as possible.

In the embodiment of the application, aiming at the situation that the current state is reported too much and the primary and secondary are mixed, the indexes critical to the flow control decision are screened out from the reported various state data. Referring to table 1, table 1 is index information used in the examples of the present application:

TABLE 1

The quality indicators mentioned in the above table may be selected for use in part in the embodiments of the present application. For example, since the audio quality-free reference score and the video quality-free reference score may be used as a measure of audio or video quality, respectively, and the quantization parameter may also be used as a measure of audio or video quality, if the audio quality-free reference score and the video quality-free reference score are selected, the quantization parameter may not be selected. For another example, since the data such as the sampling rate and the coding type are usually fixed values in the actual implementation process, the data may not be selected.

Step S204, the communication state data in the current period and the current coding data are combined to obtain the combined data to be processed.

Specifically, the computer device may perform a combination process on the communication status data and the current encoded data in the current period to obtain combined data to be processed.

It will be appreciated that the content of the communication state data is generally numerical, and that even if there are multiple dimensions of communication state data, the different dimensions of communication state data may be represented by vectors consisting of different numerical values. The content of the encoded data is also typically a numerical value, so that encoded data of different dimensions can also be represented as constituent vectors. Thus, in step S204, "combining processing" may refer to splicing two data pieces so as to facilitate subsequent processing as a whole. It should be noted that the combination process may be not only a splicing process, but also other further processes before the combination, such as a dimension reduction process when the data dimension is too much; alternatively, for extracting the key information, convolution processing or the like is performed, which is not particularly limited in the embodiment of the present application.

In some embodiments, the combining processing is performed on the communication state data and the current coding data in the current period to obtain to-be-processed combined data, including: respectively carrying out convolution processing on the current coding data and the communication state data to obtain respective corresponding convolution processing characteristics; combining the convolution processing features results in combined data to be processed.

Specifically, the computer device may perform convolution processing on the current encoded data and the communication state data, respectively, to obtain respective corresponding convolution processing features. The convolution process may specifically be a one-dimensional convolution process. Furthermore, the computer device may combine the convolution processing features to obtain combined data to be processed.

In some embodiments, a trained flow control model is deployed on the computer device, and the communication state data and the current coding data in the current period are combined and processed through the flow control model to obtain combined data to be processed. All indexes in a specific index list in the table 1 have respective numerical characteristics, and the flow control model is built based on a deep learning network, so that unified normalization processing is required for the index data. The computer device may normalize each of the index data to a value interval of 0 to 1, respectively, according to the numerical characteristics of each of the index data. The method comprises the steps of carrying out normalization processing on index data corresponding to each index in the current coding data and the communication state data, and then combining to obtain to-be-processed combined data.

The computer equipment can form historical state distribution by the index data reported by the preset times in order to predict more accurate flow control decisions, and the to-be-processed combined data is obtained after combination because each index data in the reported current coding data and communication state data belongs to time sequence data. For example, the computer device may compose a historical state distribution for the index data of the first 8 reports corresponding to the 14 index distributions, instead of relying only on the report data of the last time. After data preprocessing and partial index data logical combination, 14-by-8-dimension combined data to be processed can be obtained.

For example, if the period is from 1 pm to 8 pm, the communication state data in the current period is actually the communication state data at the time of 1 pm to 8 pm after the flow control treatment of 1 pm to 8 pm, and the current code data is the code used by 1 pm to 1 pm 8 pm. It can be seen that, on the premise of the above description, the communication status data and the current encoded data in the current period have only a single value in the current period. It can be appreciated that it is difficult to ensure the accuracy of the prediction result of the next period flow control decision only according to a single value.

For the above mentioned case, in the actual implementation, the current period may be further divided into a plurality of sub-periods. For example, 1 pm to 1 pm 8 minutes may be further divided into 8 subcycles of 1 pm to 1 pm 1 minute, 1 pm to 1 pm 2 minutes, 1 pm 2 to 1 pm 3 minutes, the. Wherein, each sub-period can obtain communication state data and coding data. If the total of the communication state data and the coded data is 14 types of data, 8 values can be obtained for each type of data in the current period, so that each type of data can form an 8-dimensional vector. Accordingly, in this step 204, one-dimensional convolution processing may be performed on the 8-dimensional vector corresponding to each type of data.

It can be understood that, for any type of data, when one-dimensional convolution processing is performed on the 8-dimensional vector corresponding to the type of data, one-dimensional convolution processing may be performed on the type of data, or multiple one-dimensional convolution processing may be performed, which is not particularly limited in the embodiment of the present application. In addition, the one-dimensional convolution processing can change data in a certain dimension into data in any dimension by setting a convolution kernel. Thus, the 8-dimensional vector corresponding to each type of data can be changed to a vector of an arbitrary dimension. For example, an 8-dimensional vector corresponding to each type of data may be transformed into a 128-dimensional vector.

Based on the above description, if the communication state data and the encoded data are 14 types of data in total, the convolution processing feature corresponding to each type of data may be a vector after the convolution processing. Thus, the convolution processing characteristics corresponding to each type of data are combined, and a matrix composed of vectors can be obtained. For example, if the convolution processing feature for each type of data is a 128-dimensional vector, then 14 types of data may form a matrix of 14×128. It should be noted that, the periods and the sub-periods mentioned in the above process are merely for convenience of description, and the lengths of the periods and the sub-periods are listed as the same. In the actual implementation process, the period lengths may be the same or different, which is not particularly limited in the embodiment of the present application.

In the above embodiment, the communication state data and the current encoded data are respectively convolved and then combined, so that the important features in the current encoded data and the communication state data can be filtered, and the accuracy of the prediction result can be improved.

Step S206, self-attention mechanism processing is carried out on the combined data to be processed to obtain intermediate processing characteristics, multi-task classification is carried out on the basis of the intermediate processing characteristics, and target flow control data of at least one dimension is output.

Among other things, self-attention mechanisms are variations of attention mechanisms that reduce reliance on external information, and are more adept at capturing internal dependencies of data or features.

In particular, the computer device may perform self-attention mechanism processing on the combined data to be processed, resulting in intermediate processing features. And then performing multi-task classification based on the intermediate processing characteristics, and outputting target flow control data of at least one dimension.

In the step, the self-attention mechanism processing is carried out on the combined data to be processed, which is mainly characterized in that the weight of some dimension data in the combined data to be processed is increased, so that the relationship between the dimension data and the output result of the subsequent processing process is more compact.

In addition, as can be seen from the above procedure, the target flow control data can be used to indicate the encoded data used in the next cycle. Thus, outputting the target flow control data of at least one dimension can be regarded as outputting the encoded data of at least one dimension. The dimensions corresponding to the encoded data generally include an encoding type, a resolution, a frame rate, etc., and the encoded data of each dimension may be explicitly and exhaustively encoded. Thus, with all values known, the task of outputting encoded data can be considered as a task of classifying based on all known values, which is the origin of the multi-task classification mentioned in this step. Meanwhile, in this step, "multiple" of the multitasking "corresponds to" at least one dimension ". For example, if three dimensions of target flow control data need to be output, the "multitask classification" corresponds to three tasks.

In some embodiments, the target flow control data may be used to indicate the encoded data used for the next cycle. Therefore, the target flow control data can be the number corresponding to the coding data type and can also be the identifier corresponding to the coding data type, and the embodiment of the application does not specifically limit the representation form of the target flow control data. Based on the above description, the encoded data can be directly determined from the stream control data.

In some embodiments, a trained flow control model may be deployed on the computer device, and the communication state data and the current coding data in the current period may be combined by using the trained flow control model to obtain combined data to be processed; and performing self-attention mechanism processing on the combined data to be processed to obtain intermediate processing characteristics, performing multi-task classification based on the intermediate processing characteristics, and outputting target flow control data of at least one dimension. For the training process of the flow control model, please refer to the details of the following embodiments.

Step S208, based on the target coding data determined by the target flow control data, the flow control processing of the multimedia data generated in the next period in the multimedia communication process is triggered.

Specifically, the computer device may trigger the flow control process to be performed on the multimedia data generated in the next period in the multimedia communication process based on the target encoded data determined by the target flow control data. When the method is executed by the server, the server can send the target stream control data to the terminal, and the terminal determines the target coding data matched with the target stream control data, so that stream control processing is carried out on the multimedia data generated in the next period in the multimedia communication process according to the target coding data. When the method is executed by the terminal, the terminal can directly determine the target coding data matched with the target flow control data, so that the flow control processing is carried out on the multimedia data generated in the next period in the multimedia communication process according to the target coding data.

In some embodiments, the target flow control data may specifically include at least one of a coding rate class, a resolution class, or a frame rate class. Accordingly, the target encoded data may specifically include at least one of an encoding rate, a resolution, or a frame rate. That is, when the coding rate category predicted by the computer device, the terminal may select one of the coding rates belonging to the coding rate category; when the resolution category predicted by the computer equipment, the terminal can select one resolution which belongs to the resolution category; when the frame rate class is predicted by the computer device, the terminal may select one of the frame rates belonging to the frame rate class. Furthermore, the terminal may apply at least one of the selected coding rate, resolution or frame rate to the multimedia communication client to stream control the multimedia data generated in the next period in the multimedia communication process.

The method in which the terminal selects a specific coding rate from the coding rate categories, the method in which the terminal selects a specific resolution from the resolution categories, and the method in which the terminal selects a specific frame rate from the frame rate categories may be selected randomly, or may be a method in which an intermediate value or an extremum value, a corresponding preset value, or the like is selected, which is not limited in this embodiment of the present application.

According to the method for performing flow control on the multimedia data, the current coding data in the current period and the communication state data generated by performing flow control processing based on the current coding data are combined, so that the to-be-processed combined data fused with a plurality of index dimension data can be obtained. The current coding data and the communication state data can reflect the network environment condition and the subjective experience of the user on the whole, so that the current coding data and the communication state data are combined and processed, and the prediction of the flow control data can be more comprehensively guided. And then performing self-attention mechanism processing on the combined data to be processed, and then performing multi-task classification to output target flow control data with at least one dimension. Therefore, the coding data and the communication state data of different types can be distinguished in importance degree through a self-attention mechanism, and the primary and secondary deducing effects of high-level characterization features are fully exerted, so that the output target flow control data can be more suitable for complex and changeable real network environments when being used for flow control processing of the next period, and the flow control effect is good. In addition, the self-attention mechanism is adopted, so that the self-attention mechanism can capture the correlation inside the data and reduce the dependence on external information, and the output flow control data can be more suitable for complex and changeable real network environments.

Furthermore, when the execution body is a terminal, compared with the GCC algorithm, the flow control decision generator is converted into the terminal by the background server, so that the timeliness of the flow control decision can be ensured. In addition, the method flow can be executed only based on the communication state and the coded data and is not dependent on other algorithms, so that network change in a multimedia communication scene can be comprehensively and accurately fed back.

In combination with the foregoing embodiments, in some embodiments, the current encoded data includes at least one of an encoding rate, a resolution, or a frame rate, and the target stream control data includes at least one of an encoding rate class, a resolution class, or a frame rate class; the communication status data includes at least one of a network packet loss rate, network delay information, network jitter information, or a katana rate.

The coding rate refers to the proportion of useful information in the data stream after the analog signal is sampled, quantized and coded. Resolution refers to the precision of the screen image, which refers to the sum of the pixels displayed by the display. The frame rate refers to the frequency or rate at which images in frames called units appear continuously on a display. As can be seen from the description of the above embodiments, the target flow control data may be used to indicate the encoded data used in the next period. Thus, in the embodiment of the present application, the target stream control data may include at least one of a coding rate class, a resolution class, or a frame rate class, where the class indicates the coded data used in the next period. The target flow control data may be represented by a one-hot vector to represent a class of encoded data, which is not specifically limited in the embodiment of the present application.

Network packet loss rate refers to the ratio of the number of lost packets to the number of packets transmitted, and is generally related to the packet length and the frequency of packet transmission. The network delay information refers to the time required for a packet to travel through a network path from a sender to a receiver. Network jitter information refers to the time difference between the maximum delay and the minimum delay for identifying the stability of the network. The click-through rate refers to the ratio between the duration of the click-through of the multimedia communication and the total duration of the multimedia communication.

In the scheme provided by the embodiment of the application, the multimedia communication quality and the network state can be changed due to the various data, so that the output flow control data can be more suitable for complex and changeable real network environments, the flow control effect is better, and the method is suitable for more multimedia communication scenes by taking the various data as the basis for analyzing the network environment and flow control decision.

In some embodiments, the self-attention mechanism processing is performed on the combined data to be processed to obtain intermediate processing features, including: encoding the combined data to be processed through at least one self-attention module to obtain intermediate processing characteristics; when a plurality of self-attention modules exist, the self-attention modules are sequentially connected, input data of the first self-attention module is to-be-processed combined data, input data of the non-first self-attention module is output characteristics after the coding processing of the connected previous self-attention module, and the output characteristics of the last self-attention module are intermediate processing characteristics.

The self-attention module is a combined block comprising a plurality of network structures and is used for obtaining intermediate processing characteristics after encoding different types of encoded data and communication state data, and the intermediate processing characteristics can distinguish the importance degrees of different index data. The number of self-attention modules may be set according to the requirement, and the embodiment of the present application is not particularly limited. The correlation in the data can be captured as much as possible through the self-attention modules with a large number, so that the degree of dependence on external information is reduced, and if the number is too large, the calculation amount is increased. Therefore, in the actual implementation process, the number of the self-attention modules can be 3, so that the application effect and the calculated amount of the attention mechanism are balanced.

In the scheme provided by the embodiment of the application, as different types of coded data and communication state data can be distinguished in importance degree through an attention mechanism in the flow control processing process, the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better.

In some embodiments, the self-attention module includes a multi-head attention layer and a forward processing layer. For any self-attention module, the encoding processing step of any self-attention module comprises the following steps: the multi-head attention processing is carried out on corresponding input data through the multi-head attention layer in the self-attention module, so that a multi-head attention processing result is obtained; and performing forward processing on the multi-head attention processing result through a forward processing layer in the self-attention module to obtain the output characteristics of the corresponding self-attention module.

The multi-head attention processing is mainly to split input data into multi-head data, determine respective weight coefficient matrixes of each head of data and perform weighting processing, and integrate weighting processing results of each head of data to obtain multi-head attention processing results of the self-attention module. The forward processing layer may be formed by two fully connected layers, and the number of fully connected layers in the forward processing layer is not specifically limited in the embodiments of the present application.

In the scheme provided by the embodiment of the application, the multi-head attention layer can split the high-dimensional input data into a plurality of low-dimensional data, and respectively perform weight processing on the plurality of low-dimensional data, so that the characteristics of the plurality of dimensions of the data can be reserved as much as possible, and the data loss is reduced.

In some embodiments, the multi-head attention processing is performed on the corresponding input data through the multi-head attention layer in the self-attention module to obtain multi-head attention processing results, which specifically includes: full connection processing is carried out on corresponding input data through a multi-head attention layer in the self-attention module, so that full connection characteristics are obtained; splitting the full connection feature to obtain a plurality of full connection sub-features; performing scale point multiplication attention processing on each full-connection sub-feature to obtain a plurality of multi-head attention sub-features; splicing a plurality of multi-head focusing sub-features to obtain multi-head focusing features; and performing full connection processing on the multi-head attention feature to obtain a multi-head attention processing result.

Referring to fig. 3, fig. 3 is a schematic diagram of a multi-headed attention process in one embodiment. As shown in fig. 3, the computer device may split the full connection feature to obtain n full connection sub-features of power 2. And respectively performing scale point multiplication attention processing on all the full-connection sub-features to obtain a plurality of multi-head attention sub-features. The computer device may splice each sub-feature of interest to obtain multiple features of interest. And the computer equipment performs full connection processing on the multi-head attention feature to obtain a multi-head attention processing result.

In some embodiments, taking the data with 14×128 dimensions of the full connection feature as an example, the computer device splits the full connection feature to obtain n full connection sub-features of 2, for example, the full connection sub-features may be divided into 4 full connection sub-features, which are respectively 14×32 dimensions. After each full-connection sub-feature is subjected to scale point multiplication attention processing to obtain a plurality of multi-head attention sub-features, the computer equipment can splice each head attention sub-feature so as to obtain multi-head attention features, for example, after 4 14 x 32 full-connection sub-features are spliced, 14 x 128 data can be obtained again, and the multi-head attention features are obtained. And the computer equipment performs full connection processing on the multi-head attention feature to obtain a multi-head attention processing result.

In the embodiment, the scale point multiplication attention processing can adaptively distinguish the importance degrees of the features with different dimensions, and the primary and secondary deducing functions of the high-level characterization features are fully exerted, so that the subsequently output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better.

In some embodiments, the performing scale point multiplication attention processing on each full-connected sub-feature to obtain a plurality of multi-head attention sub-features includes: for any full-connection sub-feature, performing matrix multiplication on the corresponding full-connection sub-feature and the full-connection sub-feature to obtain a matrix multiplication result; performing scale transformation on the matrix multiplication result to obtain a scale transformation result; mapping the scale transformation result into a weight matrix corresponding to the corresponding full-connection sub-feature through a first activation function; multiplying the weight matrix with the corresponding full-connection sub-feature to obtain the multi-head attention sub-feature corresponding to the corresponding full-connection sub-feature.

Specifically, for each full-connection sub-feature of the plurality of full-connection sub-features, the computer device may perform the scale point multiplication attention process in the same manner, resulting in a plurality of multi-headed attention sub-features.

In the manner of performing the scale point multiplication attention processing on the single full-connection sub-feature, specifically, referring to fig. 3, as shown in fig. 3, the computer device may perform matrix multiplication on the full-connection sub-feature and itself to obtain a matrix multiplication result. And performing scale transformation on the matrix multiplication result to obtain a scale transformation result. The scaling result is mapped by a first activation function, which may in particular be a softmax normalization function, into a weight matrix corresponding to the respective fully connected sub-feature. And multiplying the weight matrix with the full-connection sub-feature to obtain the multi-head concerned sub-feature corresponding to the corresponding full-connection sub-feature.

Taking the full-connection sub-feature as a feature matrix with 14×32 dimensions as an example, the full-connection sub-feature is multiplied by itself, that is, the feature matrix of 14×32 is multiplied by the feature matrix of 32×14, so as to obtain a matrix of 14×14. The scaling is mainly performed to ensure that the fractional value of the output of the subsequent first activation function can be within a suitable range.

In the above embodiment, the scale point multiplied attention processing can distinguish the importance degrees of the features with different dimensions, so that the subsequently output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better.

In some embodiments, multitasking classification based on intermediate processing features, outputting target flow control data of at least one dimension, comprising: performing residual processing on the intermediate processing characteristics for each dimension in at least one dimension to obtain residual processing results corresponding to each dimension; carrying out full connection processing on each residual processing result to obtain a corresponding full connection processing result; and mapping the full-connection processing result into target flow control data corresponding to the corresponding dimension through a second activation function.

Specifically, when multiple dimensions exist, the computer device can perform multitasking classification processing based on the intermediate processing features in parallel, that is, for each dimension task, the computer device can perform residual processing on the intermediate processing features to obtain residual processing results corresponding to each dimension, and then perform full connection processing on each residual processing result to obtain corresponding full connection processing results. The full connection processing results are then mapped into target flow control data corresponding to the respective dimensions by a second activation function (which may be a softmax function in particular).

It should be noted that, as understood from the foregoing embodiments, the target stream control data may include at least one of a coding rate category, a resolution category, or a frame rate category. Thus, reference herein to "at least one dimension", i.e. corresponding to "at least one of the above, i.e. outputting encoded data of at least one of a coding rate class, a resolution class or a frame rate class, or other types of encoded data are also possible.

It will be appreciated that the above is mainly the three processes of residual processing, full connection processing and activation function processing. If the three processing procedures are implemented by a multi-head classifier, in the actual implementation process, each time the target flow control data of one dimension needs to be output, one multi-head classifier can be designed for the target flow control data. Thus, in actual implementation of embodiments of the present application, the number of multi-headed classifiers may be consistent with the overall dimension of the output data.

In the above embodiment, since the multi-dimension flow control data can be output, the output flow control data can be more suitable for complex and changeable real network environments, the flow control effect is better, and the method and the device are suitable for more multimedia communication scenes.

The process mentioned in the above embodiment is mainly to apply the model with self-attention mechanism to realize the internal processing flow of the flow control process. In an actual implementation process, the model may be implemented as a reinforcement learning model or other deep learning model, and the embodiment of the application does not specifically limit the type of the model. For ease of description and understanding, the embodiments of the present application use an agent model in reinforcement learning (i.e., the flow control model referred to herein) to illustrate the above process.

Taking the multimedia communication process mentioned by the method as an audio/video real-time call process as an example, the process of the above flow control method is illustrated by referring to fig. 4 and 5. Fig. 4 is a schematic diagram of an architecture of an intelligent agent model in one embodiment, and fig. 5 is an interaction schematic diagram based on the intelligent agent model in one embodiment.

Referring to fig. 4, the leftmost part of fig. 4 is a data preprocessing module, which is mainly used for acquiring current coded data in a current period and communication state data generated by performing flow control processing on the current coded data. Wherein the communication status data may be obtained by a server and the currently encoded data may be obtained by a multimedia communication client. The communication status data may include at least one of a network packet loss rate, network delay information, network jitter information, or a cartoon rate, and the current encoded data may include at least one of an encoding rate, a resolution, or a frame rate.

Of course, in the actual implementation process, the communication status data may further specifically include at least one of a video packet loss rate, an audio packet loss rate, video jitter information, audio jitter information, video clip-on rate, audio clip-on rate, or audio error concealment information. The current encoded data may further specifically include at least one of a video encoding rate, hard-coded soft-coded information, or an audio encoding rate, and the embodiment of the present application does not specifically limit the content included in the communication status data and the current encoded data. It should be noted that, since data such as a sampling rate and a coding type are usually fixed values in the actual implementation process, the data may not be included in the communication state data and the current coding data.

If the current encoded data and the communication state data can obtain 14 types of index data together after being processed by the data preprocessing module, the 14 types of index data are input to the fluidics decision AI agent backup module. Wherein, each type of index data is sequentially input to 2 one-dimensional convolution layers conv_1d. Therefore, 28 one-dimensional convolution layers can be arranged in the flow control decision AI intelligent agent back bone module. Of course, in practical implementation, each type of data may be configured into other numbers of one-dimensional convolution layers, which is not limited in particular by the embodiment of the present invention.

It is known from the content of the above embodiment that the sub-period may be continuously divided in the current period. Thus, each type of data may be 8-dimensional vectors, each 8-dimensional vector may obtain 1 128-dimensional vector after passing through 2 conv_1d, so that after 1-dimensional convolution processing, a matrix of 14×128 may be obtained in practice.

After obtaining the matrix of 14×128, the matrix may then pass through a self-attention module consisting of a multi-head attention layer and a forward processing layer. In fig. 4, "Multi-head attention" is the Multi-head attention layer, and "Feed forward" is the forward processing layer, and the enclosed dashed box represents the self-attention module. As can be seen from fig. 4, the number of self-focusing modules may be more than one, and the number of self-focusing modules may be 3 in the practical implementation, which is not particularly limited in the embodiment of the present application. It should be noted that, the connection relationship between the self-attention modules is not shown in fig. 4, and the self-attention modules may be connected in series in the practical implementation process. For example, if the number of self-attention modules is 3, a matrix of 14×128 may sequentially pass through three serial self-attention modules. By means of serial connection, the correlation inside the data can be captured as much as possible, so that dependence on external information is reduced.

After the self-attention module processes, a feature matrix of 14×128 can be obtained, that is, the flow control decision AI agent back bone module can output the feature matrix of 14×128. The feature matrix of 14 x 128 may then be input into a multi-head classifier to achieve multi-tasking classification. As in the example of fig. 4, three dimensions of target stream control data may be obtained, namely coding rate, resolution and frame rate. It should be noted that, in fig. 4, only one multi-head classifier is shown by way of example, the number of multi-head classifiers in the practical implementation process may be consistent with the total dimension of the output data, that is, each time the target flow control data of one dimension needs to be output, one multi-head classifier may be designed for each time.

The multi-headed classifier may include a Residual Block (Residual Block), a full-concatenated layer (FC), and a softmax layer. Wherein the Residual block (Residual Blocks) may comprise two fully connected layers, a Residual connection being provided between an input of the two fully connected layers and an output of the two fully connected layers. As shown in the right side of fig. 4, "Multi-head Classifier" means a Multi-head Classifier, and a plurality of Multi-head classifiers may constitute the flow control decision prediction module shown in fig. 4.

For the "Multi-head attention layer mentioned in the above procedure, the structure thereof can be referred to fig. 3. The "input data" in fig. 3 is a matrix obtained by one-dimensional convolution processing to obtain 14×128, and the matrix is processed by a full-connection layer, and then 4 paths of data can be obtained by multi-head splitting of channels, which are respectively 14×32. Wherein each 14 x 32 matrix can be processed with reference to the refinement indicated by the right dashed line. Specifically, for a certain 14×32 matrix, due to self-attention mechanism, the 14×32 matrix can be copied into 3 copies, and then taken as V, K and Q, respectively. Multiplying V and Q, performing scale transformation on the multiplied result, and performing normalization index processing on the scale transformation result through an activation function, so that a weight matrix corresponding to the 14 x 32 matrix can be obtained. Multiplying the weight matrix by Q to obtain the feature matrix of the 14 x 32 matrix after the attention processing by the scale point multiplication.

Through the process, each path of data can obtain a 14 x 32 feature matrix, and then the multi-head focusing feature can be obtained through multi-head channel splicing. Finally, the multi-head attention processing result processed by the multi-head attention layer can be obtained through the full connection layer. After processing through a multi-head attention layer, the processing can be performed by a forward processing layer, and then the processing of a self-attention module is completed. After processing by multiple self-attention modules, e.g., 3, intermediate processing features can be obtained.

The above-described fig. 3, fig. 4 and the related description mainly explain the structure of the behavior prediction network (also referred to as an Actor network) in the flow control decision AI agent. As can be seen from fig. 4, the Actor network is mainly composed of a flow control decision AI intelligent agent backup module and a flow control decision prediction module, and inputs the communication state given by the background server and the encoded data reported by the client, and outputs the flow control data in three dimensions as a flow control decision. The flow control decision is applied to the client environment, resulting in the change of each state index value and the reorder value. Because the flow control decisions are all discrete quantized values, the Actor network can be understood as a multitasking "classification" model.

It will be appreciated that the flow control decision AI agent includes a behavior evaluation network (also known as a Critc network) in addition to the Actor network. The Critic network is mainly composed of a flow control decision AI intelligent agent backbone module and an accumulated reorder prediction module, the input is the same as the content input by the Actor network, and the output is an accumulated reorder prediction value. The accumulated reorder prediction module may include a residual module and a full-connection layer, as shown in the lower right of fig. 4. It should be noted that, as shown in fig. 4, the Actor network and the Critc network share a set of backhaul base network, and may be alternatively not shared in actual implementation, which is not limited in particular in the embodiment of the present application.

In combination with the above flow, the flow control interaction process of the flow control decision AI agent can refer to fig. 5. The client state is mainly represented by the encoded data used by the client in the current period, namely the current encoded data in the current period. The flow control decision AI intelligent agent obtains the communication state from the background server, obtains the client state from the client, carries out flow control decision prediction based on the communication state and the client state, and acts on the client to be perceived by a user.

In the scheme provided by the embodiment of the application, as different network environments and client states are different in importance degree of the flow control decision, the main reasons influencing the rate switching, resolution and frame rate adjustment are real-time communication states such as audio and video packet loss rate, uplink and downlink delay and the like, and logic reasoning data such as a click-through rate and a frame rate are more favorable for assisting in guiding the flow control decision. Therefore, for different environment statistical states, the embodiment of the application adopts the attention mechanism to adaptively adjust the importance degree of different data, and fully plays a primary and secondary deducing role of high-level characterization characteristics, so that the output flow control decision is more accurate.

In addition, because the flow control decision affecting the subjective experience of the user is not single, the coding rate, the video resolution and the frame rate of the audio and video can be intuitively embodied on the user experience, so that at each moment, the three reach the optimal gear value, and better product experience can be given to the user. The AI intelligent agent can predict the proper flow control decision gear value in a multi-angle way, and the embodiment of the application predicts three flow control decisions of the coding rate, the resolution and the frame rate respectively by adopting a multi-task learning strategy, so that the product experience of the flow control decisions to users can be improved as much as possible under different multimedia communication scenes.

The foregoing embodiments mainly illustrate a model application process when performing multimedia data transmission and flow control, and it is understood that the model also needs an adaptive training process. Referring to fig. 6, an embodiment of the present application provides a method for training a flow control model. The method can be applied to a terminal or a server, and the embodiment of the application does not specifically limit the type of the execution body. The terminal may be, but is not limited to, a smart phone, tablet computer, notebook computer, desktop computer, smart speaker, smart watch, vehicle terminal, smart television. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.

The method is applied to computer equipment, and the multimedia communication process related to the method is exemplified by a video real-time conversation process, and comprises the following steps:

step S602, obtaining a basic flow control model obtained by pretraining a plurality of pretraining sample groups.

The basic flow control model is obtained by pre-training a plurality of pre-training sample sets. The main purpose of the pre-training is to enable the basic flow control model to have a certain flow control treatment effect. It can be understood that the training based on reinforcement learning is relatively unstable, especially when the predictor of the Actor network and the Critic network of the just-started flow control model is not very accurate, the offline training and the online training are directly performed, which not only slows down the convergence speed of the reinforcement learning training in the later stage of the model, but also affects the performance effect of the final training model. Therefore, the initial flow control model is necessary to be subjected to Yuxuan based on historical report data, so that a basic flow control model with a certain effect, namely a baseline model, is obtained.

The pre-training process may be trained by a supervised learning method, and the embodiment of the application does not specifically limit the pre-training process. The details of the pre-training will be described in the examples below.

Step S604, for each offline period in the offline training, generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model; the offline sample data includes offline sample encoding data and offline sample communication status data.

It should be noted that, for training the flow control model, the interaction between the flow control model and the environment is relied on, and in the offline state, the flow control model cannot interact with the online environment in real time, so that the environment simulator can be used for overcoming the defect. That is, in the offline training stage, the environment simulator may generate offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model. The offline sample data of the next offline period can be continuously used for being input into the basic flow control model for training, and the offline flow control prediction data is output. And the flow control model is continuously and interactively circulated to realize the offline training of the flow control model.

It should be noted that, in order to enable the flow control model to be applicable to abnormal network emergency, the offline sample data generated offline can be processed randomly, so as to simulate the abnormal network emergency and be used for the training process of the flow control model.

In some embodiments, for any offline period, the offline flow control prediction data refers to the current offline coding data in the offline period and the offline communication state data generated by performing flow control processing on the current offline coding data, and the offline communication state data is input into the basic flow control model and then output. The offline sample data may include current offline coding data and offline communication status data of the offline period, so as to represent a network status and a client status after the offline period is ended, and the offline flow control prediction data of the current offline period is actually a flow control decision expected to be used in the next offline period. It can be understood that, in step S604, the network state and the client state at the end of the next offline period, that is, the offline sample data of the next offline period, are simulated according to the network state and the client state before the start of the next offline period after the execution of the above-mentioned flow control decision to be used in the next offline period.

Step S606, determining an offline cumulative rewards reference value of each offline period according to the offline sample data of each offline period.

Specifically, the computer device may sequentially obtain the offline sample data of each interaction period during the continuous interaction between the behavior prediction network of the flow control model and the environmental simulator until the offline sample data of each offline period is obtained. Further, the computer device may determine an offline jackpot reference value for each offline period based on the offline sample data for each period. Details of how to derive an offline jackpot reference value for each offline period based on offline sample data may be found in the examples section that follow.

In this step, as can be seen from the above, for any offline period, the offline sample data of the offline period may represent the network state and the client state after the offline period is completed. The network state and the client state after the off-line period is finished can reflect the audio and video call quality of the off-line period. For example, if the smoothness of the audio and video in the offline period can be determined based on the offline sample data, the smoothness can reflect the audio and video call quality in the offline period. Therefore, the network state and the client state of each offline period are quantized into an index for evaluating the audio-video call quality of the offline period, so that an offline accumulated rewards reference value of each offline period is obtained, and the offline flow control prediction data of each offline period can be evaluated based on the evaluation index in subsequent computer equipment, so that offline rolling training is performed on the basic flow control model.

And step 608, performing offline period rolling training on the basic flow control model based on the offline sample data and the offline accumulated rewards reference value corresponding to each offline period until the offline training stop condition is reached, and obtaining the intermediate flow control model.

Specifically, the computer device may perform offline period rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to each of the plurality of offline periods, until an offline training stop condition is reached, and obtain the intermediate flow control model. The offline training stopping condition may be a condition that a preset training frequency is reached or a training process converges, which is not specifically limited in the embodiment of the present application.

Step S610, for each online period in the online training, determining an online jackpot reference value of the current online period through online sample data of the current online period.

Specifically, after the training convergence of the offline training stage is completed, the flow control model can be deployed in a server of the multimedia communication. The intermediate flow control model predicts flow control data based on-line sample data (boad on-line coded data and on-line communication state data) and acts on the multimedia communication client during a fixed time period while collecting new change states. The online sample data collected in the fixed period is used for online training and updating of the intermediate flow control model.

It should be noted that, unlike offline training, the interactive object of online training of the flow control model is no longer an environmental simulator, but a multimedia communication client and a server running online, and the flow control model is further optimized and updated based on online real-time feedback data. The flow of the online training is similar to the offline training flow. And in the online period, a plurality of deployment servers collect online data and are used for training the middle flow control model of the training server, fine-tuning model parameters and continuously improving the prediction effect of the middle flow control model to obtain the target flow control model.

Specifically, for each online period in the online training, the computer device determines an online jackpot reference value for the current online period from online sample data for the current online period. The online sample data includes boad online coded data and online communication state data.

And step S612, performing online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to a plurality of online periods respectively, and stopping until an online training stopping condition is reached, so as to obtain a target flow control model suitable for flow control data prediction in the multimedia communication process.

It should be noted that the concept of online training is consistent with the concept of offline training, and the calculation manners of the prize value and the cumulative prize value are not repeated here. In contrast, the online training phase does not need to simulate to generate sample data of the next period, but can be directly obtained online. The online training stop condition may refer to a setting manner of the offline training stop condition, which is not specifically limited in the embodiment of the present application. In the online training process, the flow control model can be continuously adapted to the online network environment and continuously self-renew so as to continuously improve the accuracy of flow control decision.

In the scheme provided by the embodiment of the application, the basic flow control model can be close to the decision mode of the flow control model of the previous version by pre-training, so that the flow control model which is not pre-trained can be prevented from being put into use, discomfort of a user can be possibly caused, and further the product experience of the user can be reduced. By adopting offline training after the pre-training is finished, the flow control model can be trained as much as possible before being formally applied online through the offline training, so that the accuracy of flow control decision is improved. In addition, the offline sample data can be randomly processed to simulate abnormal network emergency, and the basic flow control model can be trained by the offline sample data which is randomly simulated, so that the basic flow control model can be forced to adapt to the abnormal network emergency, and the basic flow control model is forced to learn self-adjustment. After the off-line training is finished, the on-line training is finally executed, and the flow control model can be continuously updated in the on-line training, so that the prediction effect of the flow control model can be continuously improved, and the multimedia communication quality and the user experience are further improved.

In some embodiments, the base flow control model includes a behavior prediction network and a behavior evaluation network, the behavior prediction network and the behavior evaluation network sharing an encoding structure, the behavior prediction network further including a multi-headed classification structure, the behavior evaluation network further including a single-tasking processing structure; the coding structure comprises at least one self-attention module which is connected in sequence, and each self-attention module comprises a multi-head attention layer and a forward processing layer.

The behavior prediction network corresponds to the Actor network mentioned in the above embodiment, and the behavior evaluation network corresponds to the Critic network. And the behavior evaluation network shares a coding structure corresponding to the backhaul base network, and the multi-head classification structure corresponds to the multi-head classifier. The various concepts mentioned in the embodiments of the present application may refer to those mentioned in the foregoing embodiments, and are not repeated herein.

In the scheme provided by the embodiment of the application, as different types of coded data and communication state data can be distinguished in importance degree through an attention mechanism in the flow control processing process, the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better. In addition, the self-attention mechanism is adopted, so that the self-attention mechanism can capture the correlation inside the data and reduce the dependence on external information, and the output flow control data can be more suitable for complex and changeable real network environments.

In some embodiments, the method for training a flow control model further includes a step of pre-training the flow control model, and referring to fig. 7, the step specifically includes the following:

step S702, historical communication state data and historical coding data corresponding to the same historical period in the historical report data are formed into historical sample data.

The historical report data is communication related data reported by the multimedia communication client and the server together in a historical stage. For a certain history period, the history code data of the history period refers to code data used by the history period, and the history communication state data of the history period refers to communication state data generated by the history period after the flow control processing is performed through the history code data. The computer device may combine the two data corresponding to the same history period into history sample data for that history period.

Step S704, for the current history period, determining the historical flow control reference data corresponding to the current history period based on the historical encoding data of the next history period, and determining the historical accumulated rewards reference value corresponding to the current history period according to the historical sample data of the current history period.

As can be seen from the above description of the embodiments, the target stream control data may be used to indicate the encoded data used in the next period. Thus, it can be appreciated that for a certain historical period, the historical flow control reference data used for the next historical period of the historical period is predicted based on the historical sample data of the historical period, and the historical flow control reference data used for the historical period is associated with the historical encoding data of the next historical period.

Thus, the computer device may determine historical flow control reference data corresponding to the current historical period based on the historical encoded data of the next historical period. In connection with the description of the jackpot reference in the above embodiments, the computer device may also determine a historical jackpot reference corresponding to the current historical period based on historical sample data of the current historical period. The calculation method of the historical jackpot reference value can refer to the calculation method of the offline jackpot reference value in the following embodiment.

In step S706, the historical sample data, the historical flow control reference data and the historical cumulative rewards reference value corresponding to the same historical period are used as a set of pre-training sample sets.

Step S708, pre-training the initial flow control model to be trained according to a plurality of pre-training sample groups until reaching a pre-training stopping condition, and obtaining a basic flow control model.

Specifically, the computer device may use the obtained historical sample data, historical flow control reference data, and historical cumulative rewards reference value for each historical period as a pre-training sample set corresponding to each historical period. The pre-training process is completed again through step S708.

The pre-training mode may be a supervised learning training mode, and the pre-training stopping condition may be a condition of reaching a preset training frequency or convergence of a training process, which is not specifically limited in the embodiment of the present application.

In the scheme provided by the embodiment of the application, the basic flow control model can be close to the decision mode of the flow control model of the previous version through pre-training, so that the flow control model which is not pre-trained can be prevented from being put into use, discomfort of a user can be possibly caused, and further the product experience of the user can be reduced.

In some embodiments, the step of pre-training the initial flow control model to be trained according to a plurality of pre-training sample sets until reaching a pre-training stop condition to obtain a basic flow control model specifically comprises the following steps:

Based on a behavior prediction network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group and outputting historical flow control prediction data; determining a first cross entropy loss according to the historical flow control prediction data and the historical flow control reference data corresponding to each pre-training sample group; based on a behavior evaluation network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group, and outputting a historical accumulated rewards predicted value; determining a first prize loss based on the difference between the historical jackpot prize forecast and the corresponding historical jackpot prize reference; constructing a pre-training loss function based on the first cross entropy loss and the first reward loss; and pre-training the initial flow control model through the pre-training loss function until reaching the pre-training stopping condition, and stopping to obtain the basic flow control model.

For ease of understanding, the training process will now be described by way of example, and a number of known [ state, action, cumulative reorder reference ] triples may be constructed prior to execution of the embodiments of the present application. Wherein each triplet of data corresponds to a set of pre-training sample sets as mentioned in the above embodiments.

Taking the t-th history period as an example, the pre-training sample group corresponding to the t-th history period can be recorded as[state _t ，action _t Accumulated reorder reference value t]. Wherein, state _t Historical sample data corresponding to the t-th historical period, and action _t Historical flow control reference data corresponding to the t-th historical period, and accumulated report reference value t corresponds to the historical accumulated reward reference value of the t-th historical period.

From the above description of the embodiments, it can be seen that the behavior prediction network in the initial flow control model can be based on state _t Historical flow control prediction data of the t-th historical period is predicted. Whereas the historical flow control reference data of the t-th historical period is known, so that based on the principle of supervised learning training, a first cross entropy loss function can be constructed to determine the first cross entropy loss. Taking historical flow control prediction data as three-dimensional data and respectively as an example of coding rate, resolution and frame rate, a first cross entropy loss _cls The first cross entropy loss may be determined by combining the coding rate cross entropy loss, the resolution cross entropy loss, and the frame rate cross entropy loss, and may specifically be obtained by performing weighted summation.

In some embodiments, the first cross entropy loss may be constructed with reference to the following equation (1):

The above formula (1) is a loss function corresponding to the Actor network in the pre-training process. Where N represents the total number of pre-training sample sets and i represents the i-th pre-training sample set. Alpha represents the weight value of multi-head prediction cross entropy loss corresponding to the coding rate, B represents the total category number of the coding rate, and j represents the j-th category of the coding rate. b _ij The value of the code rate in the historical flow control predicted data of the ith pre-training sample group is related to the value of the code rate, if the value of the code rate in the historical flow control predicted data of the ith pre-training sample group just corresponds to the jth category, b _ij The value of (a) is 1, otherwise b _ij The value of (2) is 0.ρ _bij The value of (a) is the code rate of the historical flow control predicted data of the ith pre-training sample groupThe value corresponds exactly to the probability of the j-th category.

Similarly, β represents the weight value of the multi-head prediction cross entropy loss corresponding to the resolution, R represents the total number of categories of the resolution, and k represents the kth category of the resolution. r is (r) _ik The value of (2) is related to the value of the resolution in the historical flow control prediction data of the ith pre-training sample group, if the value of the resolution in the historical flow control prediction data of the ith pre-training sample group just corresponds to the kth category, r _ik Has a value of 1, otherwise r _ik The value of (2) is 0.ρ _rik The value of the resolution in the historical flow control prediction data of the ith pre-training sample group just corresponds to the probability of the kth category.

Gamma represents the weight value of multi-head prediction cross entropy loss corresponding to the frame rate, F represents the total category number of the frame rate, and l represents the first category of the resolution. f (f) _il The value of (a) is related to the value of the frame rate in the historical flow control predicted data of the ith pre-training sample group, if the value of the frame rate in the historical flow control predicted data of the ith pre-training sample group just corresponds to the first category, f _il Takes on a value of 1, otherwise f _il The value of (2) is 0.ρ _fil The value of the frame rate in the historical flow control prediction data of the ith pre-training sample group just corresponds to the probability of the first category.

Through the above formula (1), a first cross entropy loss corresponding to the pre-training sample set can be calculated. From the above description of the embodiments, it can be seen that the behavior evaluation network in the initial flow control model, i.e. the Critic network, can be based on state _t The historical jackpot forecast for the t-th historical period is recorded as the cumulative reward reference t. Thus, the Euclidean distance between the historical jackpot predicted value and the historical jackpot reference value can be calculated to construct a first prize loss, namely, a loss function of the Critic network, and the following formula (2) can be specifically referred to:

In the above formula (2), N represents the total number of the pre-training sample groups, t represents the t-th pre-training sample group, and the definition of other parameters can be referred to above.

After determining the first cross entropy loss and the first prize loss, the computer device may perform a weighted sum operation on the first cross entropy loss and the first prize loss to construct a pre-training loss function, and may refer to the following formula (3):

loss _exp ＝loss _reg +loss _cls ；(3)

in the above formula (3), loss _exp Representing a pre-training loss function.

Based on the above formulas (1) to (3), the initial flow control model may be pre-trained. The pre-training stopping condition may be convergence of function values of the pre-training loss function. It should be noted that the foregoing embodiment is provided by training the Actor network with the Critic network, mainly to maintain the connection between the two networks. In the actual implementation process, the two networks may also be trained separately, that is, each is trained by using a respective loss function until the pretraining stopping condition is met, and the pretraining mode is not specifically limited in the embodiment of the present application.

In the above embodiment, the basic flow control model can be close to the decision mode of the flow control model of the previous version through pre-training, so that the situation that the flow control model which is not pre-trained is put into use may be avoided, discomfort of a user may be caused, and further product experience of the user is reduced.

In some embodiments, the offline sample data includes a code rate and a click-through rate, and determining an offline jackpot reference value for each offline period based on the offline sample data for each offline period includes:

for the current offline period, determining the definition of the current offline period according to the coding code rate of the current offline period; determining the fluency of the current offline period according to the blocking rate of the current offline period; determining the smoothness of the current offline period according to the coding rate of the current offline period and the coding rate of the next offline period of the current offline period; calculating an offline rewarding reference value of the current offline period according to the definition, fluency, smoothness, video quality-free reference score and audio quality-free reference score of the current offline period; an offline jackpot reference value for the current offline period is determined based on the offline jackpot reference value for the next offline period and the offline jackpot reference value for the current offline period.

As can be seen from the foregoing embodiments, for any offline period, the offline sample data of the offline period may include offline encoded data used in the offline period and offline communication status data generated after the offline period is subjected to the flow control processing by the encoded data, where the offline encoded data may include an encoding rate and a katon rate. Therefore, in the embodiment of the application, the smoothness, definition and smoothness of each offline period can be determined by the coding rate and the katon rate of each offline period.

Specifically, the smoothness of the video may be used to represent a click condition of the user while watching the video, where the click condition may include a click number and a click duration. The smaller the number of the blocking times is, the shorter the blocking duration is, and the higher the smoothness is. Thus, F can be used _t Representing the click-through rate of the t-th offline period, using R (F _t ) Indicating the fluency of the t off-line period.

The sharpness of a video can be measured by the coding rate of the video, and in general, the higher the coding rate, the higher the sharpness. Thus, b can be used _t Representing the coding rate of the t off-line period, using R (b _t ) Indicating the definition of the t-th offline period.

The smoothness of the video refers to the perceived switching of the coding rate when the user views the video. Wherein, the smaller the number of code rate switching times or the smaller the code rate switching fluctuation, the higher the smoothness. Thus, it is possible to pass (b) _t -b _t-1 ) I.e. the difference of the coding rates between two adjacent off-line periods, to measure the coding rate switching fluctuation of the two adjacent off-line periods. It can be appreciated that the smaller the difference, the higher the smoothness the code rate switches from the t-1 off-line period to the t off-line period. Thus, R (b) _t -b _t-1 ) Representing the smoothness of the t-th offline period.

Based on the above description of the parameters, the computer device may implement the calculation of the offline prize reference by the following equation (4):

reward _t ＝λ _b R(b _t )-λ _f R(F _t )-λ _s R(b _t -b _t-1 )+λ _a R(Qα _t )+λ _v R(Qv _t )；(4)

in the above formula (4), reward _t Offline prize reference value, qα, representing the t-th offline period _t Audio no-quality reference score, qv, representing the t-th offline period _t An audio no quality reference score representing the t-th offline period. And lambda is _b Weights, lambda, representing sharpness _f Weights, lambda, representing fluency _s Weights, lambda, representing smoothness _a Weights and lambda representing audio quality-free reference scores _v Weights representing the video no-quality reference scores.

And R is just the normalization operation, and the definition, the smoothness, the audio quality-free reference and the video quality-free reference are pulled to the same value standard, so that the calculation is convenient. The audio quality-free reference score can be used as a measure of audio definition, the video quality-free reference score can be used as a measure of video definition, the video quality-free reference score can be calculated by adopting a natural image quality evaluation algorithm or a G.1070 standard and other algorithms, the audio quality-free reference score can be calculated by adopting a program loudness measurement algorithm or a gap algorithm, and the embodiment of the application is not particularly limited to the measurement.

In the above formula (4), λ is subtracted _f R(F _t ) Lambda (lambda) _s R(b _t -b _t-1 ) Because R (F) _t ) Definition for no fluency (i.e., opposite value of fluency), R (b) _t -b _t-1 ) Defined as no sharpness (i.e., the inverse of sharpness). In practical implementation, the combination can be performed based on the definition of each parameter in the above formula (4), the calculation process different from the formula (4) is derived,the embodiment of the present application is not particularly limited thereto.

It should be further noted that the audio quality-free reference score and the video quality-free reference score may be used as a measure of audio or video quality, respectively, to participate in the review _t In addition to the audio quality-free reference score and the video quality-free reference score, the quantization parameter can be used as a measure of audio or video quality, and in the actual implementation process, the reward is calculated _t Can be used alternatively.

It will be appreciated that the computer device can calculate the reorder for each offline period by the above equation (4) _t . Wherein, the offline jackpot reference value for calculating the t offline period can refer to the following calculation formula (5):

cumulative reward _t ＝reward _t +cumulative reorder _t+1 ；(5)

For offline periods, in equation (5) above, the reward is accumulated _t An offline jackpot reference value representing the t-th offline period, and a cumulative reorder _t+1 Offline jackpot reference value, reward, representing the t+1st offline period _t An offline prize reference value representing the t-th offline period. As can be seen from the above equation (5), the cumulative reorder is known _t+1 Reward _t Can calculate and obtain the accumulated reorder _t

It should be noted that the accumulated reorder is calculated _t When it is needed to use accumulated reorder _t+1 . I.e., the jackpot value is pre-calculated in substantially reverse order. In an actual implementation, the computer device may perform offline training in rounds. An offline jackpot reference value for each offline cycle in the offline training of the current round may be determined prior to performing the offline training of the current round.

In particular, the computer device may retain the behavior prediction network and the behavior evaluation network, i.e., exploitation Actor network and Exploitation Critic network, from the previous run of training. The computer equipment can process offline sample data in a certain period based on the behavior prediction network obtained by the previous training round to obtain corresponding offline flow control prediction data, and the offline sample data in the next offline period is generated in a simulation mode based on the offline sample data in the offline period and the offline flow control prediction data through the environment simulator. And regenerating the offline sample data of the next offline period based on the offline sample data of the next offline period, and continuously iterating until the offline sample data corresponding to each period is obtained.

It should be noted that, the offline jackpot reference value is calculated in reverse order, that is, in order to calculate the offline jackpot reference value of each offline period, a cutoff offline period may be set for the offline training of each round, and the offline sample data of the cutoff offline period is input to the Critic network obtained in the previous round, that is, to the Exploitation Critic network, so as to output the offline jackpot predicted value of the cutoff offline period as the offline jackpot reference value of the cutoff offline period in the offline training of the current round.

The computer device may sequentially calculate, based on the offline sample data corresponding to each period, an offline prize reference value corresponding to each offline period according to the above formula (4). Thus, if the offline accumulated rewards reference value of the off-line period in the offline training of the current round can be obtained, and the offline rewards reference value of each offline period in the offline training of the current round can be obtained at the same time, the offline accumulated rewards reference value of each offline period in the offline training of the current round can be obtained through the reverse calculation of the formula (5).

It will be appreciated that, in the actual implementation process, a preset value may also be set as an offline jackpot reference value for the offline period in offline training of the current round, which is not limited in this embodiment of the present application. Thus, after the offline accumulated rewards reference values corresponding to the offline periods needed to be used for the offline training of the current round are collected, the offline training of the current round can be performed.

It will be appreciated that the historical jackpot reference values mentioned in the above embodiments may also be calculated with reference to the above process. In contrast, the above procedure requires setting a preset off-line period. The time of the history sample data is known, so that the preset expiration history period is not needed, the end time of the multimedia communication process is set as the last history period, and the history accumulated rewards reference value corresponding to the last history period is set as 0 or other preset values. Thus, the historical rewards reference value corresponding to each historical period can be obtained through the formula (4), and accordingly the historical accumulated rewards reference value corresponding to each historical period can be calculated in a reverse order based on the formula (5).

In the scheme provided by the embodiment of the application, the definition, the smoothness and the smoothness of the video can be comprehensively embodied in the reward function, so that the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better.

In some embodiments, the base flow control model includes a behavior prediction network and a behavior evaluation network, and based on offline sample data and offline accumulated rewards reference values corresponding to each of a plurality of offline periods, performing offline period rolling training on the base flow control model until an offline training stop condition is reached, and obtaining an intermediate flow control model includes:

Processing the offline sample data of the current offline period through a behavior prediction network of the current offline period to obtain offline flow control prediction data of the current offline period; determining a second cross entropy loss based on the offline flow control prediction data of the current offline period, and determining a first offline objective function according to the second cross entropy loss; processing the offline sample data of the current offline period through a behavior evaluation network of the current offline period to obtain an offline accumulated rewards predicted value of the current period; determining a second prize loss based on the offline jackpot prediction value of the current offline period and the offline jackpot reference value of the current offline period, and constructing a second offline objective function based on the second prize loss; training the behavior prediction network through the first offline objective function, and training the behavior evaluation network through the second offline objective function until the offline training stopping condition is reached, and obtaining the intermediate flow control model.

As shown in FIG. 8, taking the current offline period as the t-th offline period as an example, the computer device performs the process of starting the process _t And the flow control prediction data is input into an Actor network and can be output in the t off-line period. If there are a total of T offline periods, then the second cross entropy loss for all offline periods can be referenced by equation (6) below:

In the above formula (6), ρ _t Representing the t off-line period action _t Is a probability distribution of (c). It should be noted that due to the action _t That is, the offline flow control prediction data may include various data, such as resolution, frame rate, coding rate, etc., so that each data included in the offline flow control prediction data may be correspondingly calculated to obtain a second cross entropy loss. That is, in actual implementation, it may be 3 loss _ce . It should be noted that the second cross entropy loss is mainly a probability that the predicted value is caused to explore widely (expression) in the action space without excessively protruding an action of a certain type.

After deriving the second cross entropy loss, the computer device may determine a first offline objective function. In the actual implementation process, the loss function of the offline training algorithm and the second cross entropy loss can be combined, so that a first offline objective function is obtained. The offline training algorithm may be a proximal strategy optimization algorithm, a flexible actuation evaluation algorithm, a dual-delay depth deterministic strategy algorithm, or the like, which is not specifically limited in the embodiments of the present application.

As shown in FIG. 8, taking the current offline period as the t offline period as an example, the computer device inputs the state to the Critic network, and can output the offline cumulative prize prediction value of the t offline period. The second prize loss for each offline period may be determined by calculating the offline jackpot prediction value and the offline jackpot reference value for each offline period, thereby constructing a second offline objective function. Wherein the second offline objective function may refer to the following equation (7):

In the above formula (7), the reward is accumulated _t Predicted value and accumulated reorder _t The reference values respectively represent the offline jackpot predicted value and the offline jackpot reference value of the t-th offline period.

In the scheme provided by the embodiment of the application, the flow control model can be trained as much as possible before being formally applied online through offline training, so that the accuracy of flow control decision is improved. In addition, the offline sample data can be randomly processed to simulate abnormal network emergency, and the basic flow control model can be trained by the offline sample data which is randomly simulated, so that the basic flow control model can be forced to adapt to the abnormal network emergency, and the basic flow control model is forced to learn self-adjustment.

In some embodiments, a construction step of a first offline objective function provided in another exemplary embodiment of the present application is shown. The method comprises the following steps:

processing the offline sample data of the current offline period through the behavior prediction network of the previous round to obtain offline flow control experience data of the current offline period; determining a comparison loss based on the offline flow control experience data corresponding to the current offline period, the probability distribution of the offline flow control experience data, and the probability distribution of the offline flow control prediction data; determining the strengthening loss according to the contrast loss and the dominance function value; correcting the strengthening loss based on a positive and negative rewarding mechanism to obtain corrected strengthening loss; and constructing a first offline objective function according to the second cross entropy loss and the corrected strengthening loss.

Specifically, as shown in fig. 8, the computer device inputs the state to the Exploitation Actor network (i.e. the behavior prediction network obtained by the previous training round), and can output the action _t And the experience value is the offline flow control experience data of the current offline period. The contrast loss can be referred to as the following equation (8):

in the above formula (8), ρ _t Representing probability distribution of offline flow control prediction data ρ _et Representing probability distribution and action of offline flow control empirical data _et Offline flow control empirical data representing the t-th offline period may be represented by an on-hot vector.

After the contrast loss is obtained, the strengthening loss can be determined according to the contrast loss and the dominance function value. Referring to fig. 8, the merit function value advantaged for each offline period is a difference value between the offline jackpot reference value and the offline jackpot predicted value for each offline period. The embodiments of the present application are not particularly limited in the manner of determining the strengthening loss based on the contrast loss and the dominance function values, including but not limited to: the computer device may calculate a first product between the contrast loss and the dominance function value, calculate a second product between a truncated value of the contrast loss and the dominance function value, and select a smaller value from the first product and the second product as the strengthening loss. Specifically, the above procedure can be referred to the following formula (9):

loss _ppo ＝min(loss _cmp *adv，clip(loss _cmp ，1-eps，1+eps)*adv)；(9)

In the above formula (9), loss is less _cmp Represents contrast loss, clip (loss _cmp 1-eps, 1+eps) represents a cut-off value of contrast loss. clip represents a truncated function and eps represents an interval value. The truncated function represents the value of the local area _cmp When the value of (1) is within the interval formed by 1-eps and 1+eps, then for loss _cmp Cut-off is performed so that after cut-off loss _cmp The value of (2) lies within this interval.

It should be noted that the above-mentioned advantage function mainly corresponds to the positive and negative rewarding mechanism. Wherein, the dominance function value is negative, then it can be understood as negative rewards, otherwise it is positive rewards. To further highlight the positive and negative rewards mechanism of the merit function, the computer device may correct the fortification loss. The corrected strengthening loss is denoted as loss _dual Then, in combination with the second cross entropy loss, a first offline objective function may be constructed, and the following formula (10) may be referred to specifically:

loss _actor ＝loss _dual +λ _ce *loss _ce ；(10)

in the above formula (10), lambda _ce And a weight coefficient representing the second cross entropy loss.

In the above embodiment, the first offline objective function is constructed by combining the second cross entropy loss and the modified reinforcement loss, so that, in the current offline period, for the Actor network, the convergence target relies on the cross entropy loss of the prediction action (three flow control strategies), and the reinforcement loss of the combination advantage function (advantage), the prediction action and the experience action, so that the training effect of the Actor network can be significantly improved.

In some embodiments, correcting the strengthening loss based on the positive and negative rewards mechanism, resulting in a corrected strengthening loss, comprising: if the current strengthening loss is a negative value, taking a larger value between the current strengthening loss and a second awarding loss with a preset multiple as the corrected strengthening loss; if the current strengthening loss is a non-negative value, the current strengthening loss is kept unchanged.

The above process can be specifically referred to as the following formula (11):

in the above formula (11), x represents loss _ppo ，3*loss _critic I.e. a second prize loss representing a predetermined multiple, loss _ppo Indicating a loss of reinforcement.

In the scheme provided by the embodiment of the application, the positive and negative rewarding mechanism of the dominance function can be highlighted by correcting the intensification loss, so that the training effect is better.

It should be noted that, the above process is mainly a training process of offline training and various loss functions, and for online periodic rolling training, other processes may refer to a process of offline training except that sample data is obtained online instead of offline simulation. Specifically, after the offline training is completed, the intermediate flow control model obtained by the offline training can be deployed into a background server of the video real-time conversation. The intermediate flow control model may predict flow control decisions based on real-time network state client states and act on the clients while collecting new change states during a fixed period of time. Based on pairing history data collected in a fixed period, training and updating are continuously carried out on the flow control decision AI agent. The prediction effect of the agent deployment model is continuously improved, so that the quality of video real-time conversation and the user experience are improved.

In some embodiments, offline sample data based on the current offline period, and offline flow control prediction data output by the base flow control model, may be implemented by a generator in the environmental simulator. There is thus shown an environmental simulator training step provided in accordance with another exemplary embodiment of the present application, the step comprising:

communication state data and coding data corresponding to the same training period in the online acquisition data are formed into environment sample data; for the current training period, determining environmental flow control data corresponding to the current training period based on the coded data of the next training period of the current training period; processing the environmental sample data and the environmental flow control data of the current training period through a generator in the environmental simulator to be trained to generate environmental prediction data of the next training period; determining environmental loss according to the difference between the environmental prediction data of each training period and the environmental sample data of each training period;

determining generator loss according to the generation simulation degree when the generator generates the environment prediction data of each training period, and determining discriminator loss according to the discriminator in the environment simulator to be trained when discriminating the environment prediction data and the environment sample data of each training period respectively; based on the environmental loss, the generator loss and the discriminator loss, constructing a target loss function of the environmental simulator to be trained, and performing iterative countermeasure training on the environmental simulator to be trained based on the target loss function until the training stopping condition is reached, so as to obtain the environment simulator with the end of training.

The on-line collected data mentioned in the above process and the historical reported data mentioned in the previous embodiment are both historical data, and the content of the historical data may be the same or different, which is not specifically limited in the embodiment of the present application. For the current training period t, the communication status data of the t-th training period can be recorded as action _t The environmental flow control data of the t training period can be recorded as action _t The environment prediction data of the (t+1) th training period, i.e. the next training period, can be recorded as state' _t+1 。

Thus, the computer device may refer to the following equation (12) to determine the difference between the environmental prediction data for each historical period and the environmental sample data for each historical period:

in the above formula (12), loss _mse The environmental loss is represented, N represents the total number of history periods, i represents the ith history period, y represents environmental sample data of the ith history period, and y' represents environmental prediction data of the ith history period.

In addition, the generator loss may refer to the following equation (13), and the discriminator loss may refer to the following equation (14):

/>

it will be appreciated that for equation (14), the training objective is primarily to want the discriminator to discriminate birth as much as possibleThe environment prediction data generated by the constructor is false, and the environment sample data that is truly present is identified as true. For equation (13), the training objective is primarily to want the generator to generate as much environmental prediction data as possible that makes the discriminator discriminate as true. It should be noted that the network structures of the generator and the arbiter in the environment simulator may be based on a transcoder codec (Encoder-Decoder), except that the outputs of the two are not identical. The output of the decoding side in the generator is AND state' _t+1 The data in the same dimension space, and the output of the decoding end of the discriminator is connected with the softmax layer after passing through the two full-connection layers, so that the true and false classification discrimination can be performed, and the structure of the environment simulator can be specifically referred to fig. 9.

In the scheme provided by the embodiment of the application, because the environment simulator can be constructed based on the structure of the generation countermeasure network to generate the offline sample data which can generate network fluctuation at any time, the flow control model can be forced to adapt to abnormal network emergency conditions, and self-adjustment is learned, so that the flow control effect is improved.

In a specific embodiment, a method for training a flow control model is provided, which includes the following steps:

historical communication state data and historical coding data corresponding to the same historical period in the historical report data are formed into historical sample data; for a current historical period, historical flow control reference data corresponding to the current historical period is determined based on historical encoding data of a next historical period of the current historical period.

Taking the historical sample data, the historical flow control reference data and the historical accumulated rewards reference value corresponding to the same historical period as a group of pre-training sample groups; based on a behavior prediction network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group and outputting historical flow control prediction data; determining a first cross entropy loss according to the historical flow control prediction data and the historical flow control reference data corresponding to each pre-training sample group; based on a behavior evaluation network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group, and outputting a historical accumulated rewards predicted value; determining a first prize loss based on the difference between the historical jackpot prize forecast and the corresponding historical jackpot prize reference; a pre-training loss function is constructed based on the first cross entropy loss and the first reward loss.

Pre-training the initial flow control model through a pre-training loss function until reaching a pre-training stopping condition, and stopping to obtain a basic flow control model; communication state data and coding data corresponding to the same training period in the online acquisition data are formed into environment sample data; for a current training period, determining environmental flow control data corresponding to the current training period based on encoded data of a next training period of the current training period.

Processing the environmental sample data and the environmental flow control data of the current training period through a generator in the environmental simulator to be trained to generate environmental prediction data of the next training period; determining environmental loss according to the difference between the environmental prediction data of each training period and the environmental sample data of each training period; determining generator loss according to the generation simulation degree when the generator generates the environment prediction data of each training period, and determining discriminator loss according to the discriminator in the environment simulator to be trained when discriminating the environment prediction data and the environment sample data of each training period respectively; constructing a target loss function of the environmental simulator to be trained based on the environmental loss, the generator loss and the discriminator loss, and performing iterative countermeasure training on the environmental simulator to be trained based on the target loss function until the training stopping condition is reached, so as to obtain the environment simulator with the training ended;

for each offline period in the offline training, generating offline sample data of the next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model by an environment simulator; the offline sample data comprises offline coding data and offline communication state data; determining an offline accumulated rewards reference value of each offline period according to the offline sample data of each offline period; processing the offline sample data of the current offline period through a behavior prediction network of the current offline period to obtain offline flow control prediction data of the current offline period; a second cross entropy loss is determined based on the offline flow control prediction data for the current offline period.

Processing the offline sample data of the current offline period through a behavior prediction network of the previous offline period to obtain offline flow control experience data of the current offline period; determining a comparison loss based on the offline flow control experience data corresponding to the current offline period, the probability distribution of the offline flow control experience data, and the probability distribution of the offline flow control prediction data; determining the strengthening loss according to the contrast loss and the dominance function value; the dominance function value is the difference value between the offline accumulated rewards reference value and the offline accumulated rewards predicted value of each offline period; if the current strengthening loss is a negative value, taking a larger value between the current strengthening loss and a second awarding loss with a preset multiple as the corrected strengthening loss; if the current strengthening loss is a non-negative value, keeping the current strengthening loss unchanged; and constructing a first offline objective function according to the second cross entropy loss and the corrected strengthening loss.

Processing the offline sample data of the current offline period through a behavior evaluation network of the current offline period to obtain an offline accumulated rewards predicted value of the current period; determining a second prize loss based on the offline jackpot prediction value of the current offline period and the offline jackpot reference value of the current offline period, and constructing a second offline objective function based on the second prize loss; training the behavior prediction network through the first offline objective function, and training the behavior evaluation network through the second offline objective function until the offline training stopping condition is reached, and obtaining the intermediate flow control model.

For each online period in online training, determining an online accumulated rewards reference value of the current online period through online sample data of the current online period; and performing online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to the online periods respectively until the online training stopping condition is reached, and obtaining a target flow control model suitable for flow control data prediction in the multimedia communication process.

According to the flow control model training method, the pre-training is adopted firstly, so that the basic flow control model can be close to the decision mode of the flow control model of the previous version, the situation that the flow control model which is not pre-trained is put into use can be avoided, discomfort of a user can be possibly caused, and further the product experience of the user is reduced. By adopting offline training after the pre-training is finished, the flow control model can be trained as much as possible before being formally applied online through the offline training, so that the accuracy of flow control decision is improved. In addition, the offline sample data can be randomly processed to simulate abnormal network emergency, and the basic flow control model can be trained by the offline sample data which is randomly simulated, so that the basic flow control model can be forced to adapt to the abnormal network emergency, and the basic flow control model is forced to learn self-adjustment. After the off-line training is finished, the on-line training is finally executed, and the flow control model can be continuously updated in the on-line training, so that the prediction effect of the flow control model can be continuously improved, and the multimedia communication quality and the user experience are further improved.

In addition, in the flow control processing process, different types of coded data and communication state data can be distinguished in importance degree through an attention mechanism, so that the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better. In addition, the self-attention mechanism is adopted, so that the self-attention mechanism can capture the correlation inside the data and reduce the dependence on external information, and the output flow control data can be more suitable for complex and changeable real network environments.

Furthermore, the training effect is better because the positive and negative rewarding mechanism of the dominance function can be highlighted by correcting the intensification loss.

Finally, the definition, fluency and smoothness of the video can be comprehensively embodied in the reward function, so that the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better.

The flow control model can be trained in a triple training mode of pre-training, off-line training and on-line training, so that the accuracy of the prediction result of the flow control model can be improved. The different types of coded data and communication state data can be distinguished in importance degree through the attention mechanism in the flow control processing process, so that the output flow control data can be more suitable for complex and changeable real network environments, and the flow control effect is better. In addition, the self-attention mechanism is adopted, so that the self-attention mechanism can capture the correlation inside the data and reduce the dependence on external information, and the output flow control data can be more suitable for complex and changeable real network environments.

For the audio or video real-time call scene, when the user uses the daily terminal device to perform audio or video call, the user may be in various geographic environments, network types and terminal types, such as in various geographic environments of plain or deep mountain, indoor or outdoor, city or country, etc., and in various network types of wired network, wi-Fi, 5G, 4G, etc., and terminal types of different mobile phone models, etc. While the user is engaged in the audio/video call, the user naturally wants to obtain a high QoE (Quality of Experience, user experience), such as an audio/video flow with high definition and no jamming. Therefore, in combination with an audio/video real-time call scene, the embodiment of the application further provides a method for streaming video call data, and the method is applied to a computer device for example for explanation, wherein the computer device may be a terminal or a server in fig. 1, and the method includes the following steps:

in the video call process, the computer equipment acquires the frame rate, the coding code rate, the resolution, the coding type, the video quality-free reference score and the audio quality-free reference score in the current period and takes the frame rate, the coding code rate, the resolution, the coding type, the video quality-free reference score and the audio quality-free reference score as current coding data; the method comprises the steps that a computer device obtains packet loss rate, network delay information, network jitter information and card frame rate generated by performing flow control processing on current coded data and takes the packet loss rate, the network delay information, the network jitter information and the card frame rate as communication state data; the current encoded data is determined based on the encoding rate, resolution, and frame rate of the previous period.

And the computer equipment performs combination processing on the communication state data and the current coding data in the current period to obtain to-be-processed combined data.

The computer equipment performs self-attention mechanism processing on the combined data to be processed to obtain intermediate processing characteristics, performs multi-task classification based on the intermediate processing characteristics, outputs the coding rate, resolution and frame rate of the next period, and triggers the flow control processing on the video call data generated in the next period in the video real-time call process.

It can be appreciated that there may be other multimedia communication scenes besides audio/video real-time call scenes, such as live scenes and audio/video real-time on-demand scenes. Taking an audio real-time on-demand scene as an example, and combining with the audio real-time on-demand scene, the embodiment of the application also provides a method for performing stream control on audio on-demand data, and taking application of the method to computer equipment as an example for explanation, wherein the computer equipment can be specifically a terminal or a server in fig. 1, and the method comprises the following steps:

in the audio on-demand process, the computer equipment acquires the audio frame rate, the audio coding type and the audio quality-free reference score in the current period and takes the audio frame rate, the audio coding type and the audio quality-free reference score as current coding data; the method comprises the steps that a computer device obtains packet loss rate, network delay information, network jitter information and card frame rate generated by performing flow control processing on current coded data and takes the packet loss rate, the network delay information, the network jitter information and the card frame rate as communication state data; the current coding data is determined based on the audio frame rate and the audio coding rate of the previous period;

The computer equipment combines the communication state data in the current period with the current coding data to obtain combined data to be processed;

the computer equipment carries out self-attention mechanism processing on the combined data to be processed to obtain intermediate processing characteristics, carries out multi-task classification based on the intermediate processing characteristics, outputs the audio frame rate and the audio coding rate of the next period, and triggers the stream control processing on the audio on-demand data generated in the next period in the audio real-time on-demand communication process.

It should be understood that, although the steps in the flowcharts of fig. 2 and 6 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in fig. 2, 6 may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the steps or stages in other steps or other steps.

In some embodiments, as shown in fig. 10, an apparatus 1000 for streaming multimedia data is provided, where the apparatus may use a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: an acquisition module 1002, a combination processing module 1004, a self-attention mechanism processing module 1006, a multi-task classification module 1008, and a flow control processing module 1010, wherein:

an obtaining module 1002, configured to obtain current encoded data in a current period and communication status data generated by performing a flow control process on the current encoded data in a multimedia communication process; the current coding data is determined based on the historical flow control data of the previous period;

the combination processing module 1004 is configured to perform combination processing on the communication status data and the current encoded data in the current period to obtain combined data to be processed;

a self-attention mechanism processing module 1006, configured to perform self-attention mechanism processing on the combined data to be processed, so as to obtain an intermediate processing feature;

a multi-task classification module 1008 for performing multi-task classification based on the intermediate processing features, and outputting target flow control data of at least one dimension;

The flow control processing module 1010 is configured to trigger flow control processing on the multimedia data generated in the next period in the multimedia communication process based on the target encoded data determined by the target flow control data.

In some embodiments, the combination processing module 1004 is configured to perform convolution processing on the current encoded data and the communication state data, so as to obtain respective corresponding convolution processing features; combining the convolution processing features results in combined data to be processed.

In some embodiments, the self-attention mechanism processing module 1006 is configured to perform encoding processing on the combined data to be processed through at least one self-attention module to obtain an intermediate processing feature; when a plurality of self-attention modules exist, the self-attention modules are sequentially connected, input data of the first self-attention module is to-be-processed combined data, input data of the non-first self-attention module is output characteristics after the coding processing of the connected previous self-attention module, and the output characteristics of the last self-attention module are intermediate processing characteristics.

In some embodiments, the self-attention module includes a multi-head attention layer and a forward processing layer; the self-attention mechanism processing module 1006 includes:

The multi-head attention processing unit is used for performing multi-head attention processing on corresponding input data through a multi-head attention layer in the self-attention module to obtain a multi-head attention processing result;

and the forward processing unit is used for performing forward processing on the multi-head attention processing result through the forward processing layer in the self-attention module to obtain the output characteristics of the corresponding self-attention module.

In some embodiments, the multi-head attention processing unit is specifically configured to perform full-connection processing on corresponding input data through a multi-head attention layer in the self-attention module to obtain full-connection features; splitting the full connection feature to obtain a plurality of full connection sub-features; performing scale point multiplication attention processing on each full-connection sub-feature to obtain a plurality of multi-head attention sub-features; splicing a plurality of multi-head focusing sub-features to obtain multi-head focusing features; and performing full connection processing on the multi-head attention feature to obtain a multi-head attention processing result.

In some embodiments, the multi-head attention processing unit is further configured to, for any full-connection sub-feature, perform matrix multiplication on the corresponding full-connection sub-feature and itself to obtain a matrix multiplication result; performing scale transformation on the matrix multiplication result to obtain a scale transformation result; mapping the scale transformation result into a weight matrix corresponding to the corresponding full-connection sub-feature through a first activation function; multiplying the weight matrix with the corresponding full-connection sub-feature to obtain the multi-head attention sub-feature corresponding to the corresponding full-connection sub-feature.

In some embodiments, the multi-task classification module 1008 is configured to perform residual processing on the intermediate processing feature for each dimension of the at least one dimension, to obtain a residual processing result corresponding to each dimension; carrying out full connection processing on each residual processing result to obtain a corresponding full connection processing result; and mapping the full-connection processing result into target flow control data corresponding to the corresponding dimension through a second activation function.

For specific limitations on the apparatus for performing the streaming of multimedia data, reference may be made to the above limitation on the method for performing the streaming of multimedia data, and detailed descriptions thereof are omitted herein. The above-mentioned means for streaming multimedia data may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

In some embodiments, as shown in fig. 11, a flow control model training apparatus 1100 is provided, which may employ a software module or a hardware module, or a combination of both, as part of a computer device, and specifically includes: an acquisition module 1102, a generation module 1104, a first determination module 1106, an offline training module 1108, a second determination module 1110, and an online training module 1112, wherein:

an obtaining module 1102, configured to obtain a basic flow control model obtained by performing pre-training through a plurality of pre-training sample sets;

the generating module 1104 is configured to generate, for each offline period in the offline training, offline sample data of a next offline period based on the offline sample data of the current offline period and the offline flow control prediction data output by the basic flow control model; the offline sample data comprises offline coding data and offline communication state data;

a first determining module 1106 configured to determine an offline cumulative rewards reference value for each offline period according to the offline sample data for each offline period;

the offline training module 1108 is configured to perform offline periodic rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to a plurality of offline periods, until an offline training stop condition is reached, and obtain an intermediate flow control model;

A second determining module 1110, configured to determine, for each online period in the online training, an online cumulative bonus reference value of the current online period through online sample data of the current online period;

and the online training module 1112 is configured to perform online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to a plurality of online periods respectively, and stop until reaching an online training stop condition, so as to obtain a target flow control model suitable for flow control data prediction in the multimedia communication process.

In some embodiments, the basic flow control model is obtained through pre-training, and the device further comprises a pre-training module; the pre-training module comprises:

the combination unit is used for combining the historical communication state data and the historical coding data which correspond to the same historical period in the historical report data into historical sample data;

A first determining unit, configured to determine, for a current history period, historical flow control reference data corresponding to the current history period based on history coding data of a next history period of the current history period;

a second determining unit, configured to determine a historical cumulative bonus reference value corresponding to a current history period according to history sample data of the current history period;

the third determining unit is used for taking the historical sample data, the historical flow control reference data and the historical accumulated rewards reference value corresponding to the same historical period as a group of pre-training sample groups;

and the pre-training unit is used for pre-training the initial flow control model to be trained according to a plurality of groups of pre-training sample groups until reaching the pre-training stopping condition, and obtaining the basic flow control model.

In some embodiments, the pre-training unit is configured to process historical sample data in the pre-training sample set based on a behavior prediction network in an initial flow control model to be trained, and output historical flow control prediction data; determining a first cross entropy loss according to the historical flow control prediction data and the historical flow control reference data corresponding to each pre-training sample group; based on a behavior evaluation network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group, and outputting a historical accumulated rewards predicted value; determining a first prize loss based on the difference between the historical jackpot prize forecast and the corresponding historical jackpot prize reference; constructing a pre-training loss function based on the first cross entropy loss and the first reward loss; and pre-training the initial flow control model through the pre-training loss function until reaching the pre-training stopping condition, and stopping to obtain the basic flow control model.

In some embodiments, the first determining module 1206 is configured to determine, for the current offline period, a sharpness of the current offline period according to a coding rate of the current offline period; determining the fluency of the current offline period according to the blocking rate of the current offline period; determining the smoothness of the current offline period according to the coding rate of the current offline period and the coding rate of the next offline period of the current offline period; calculating an offline rewarding reference value of the current offline period according to the definition, fluency, smoothness, video quality-free reference score and audio quality-free reference score of the current offline period; an offline jackpot reference value for the current offline period is determined based on the offline jackpot reference value for the next offline period and the offline jackpot reference value for the current offline period.

In some embodiments, the base flow control model includes a behavior prediction network and a behavior evaluation network; offline training module 1108, comprising:

the first processing unit is used for processing the offline sample data of the current offline period through the behavior prediction network of the current offline period to obtain offline flow control prediction data of the current offline period;

A fourth determining unit, configured to determine a second cross entropy loss based on offline flow control prediction data of a current offline period;

a fifth determining unit for determining a first offline objective function according to the second cross entropy loss;

the second processing unit is used for processing the offline sample data of the current offline period through the behavior evaluation network of the current offline period to obtain an offline cumulative rewards predicted value of the current period;

a sixth determining unit for determining a second prize loss based on the offline jackpot predicted value of the current offline period and the offline jackpot reference value of the current offline period;

a construction unit for constructing a second offline objective function based on the second penalty loss;

and the offline training unit is used for training the behavior prediction network through the first offline objective function, training the behavior evaluation network through the second offline objective function, and stopping until the offline training stopping condition is reached, so as to obtain the intermediate flow control model.

In some embodiments, offline training module 1108 further comprises:

the third processing unit is used for processing the offline sample data of the current offline period through the behavior prediction network of the previous round to obtain the offline flow control experience data of the current offline period;

A seventh determining unit, configured to determine a contrast loss based on the offline flow control experience data corresponding to the current offline period, the probability distribution of the offline flow control experience data, and the probability distribution of the offline flow control prediction data;

an eighth determining unit, configured to determine a strengthening loss according to the contrast loss and the dominance function value; the dominance function value is the difference value between the offline accumulated rewards reference value and the offline accumulated rewards predicted value of each offline period;

the correction unit is used for correcting the strengthening loss based on the positive and negative rewarding mechanism to obtain corrected strengthening loss;

correspondingly, a fifth determining unit is used for constructing a first offline objective function according to the second cross entropy loss and the corrected strengthening loss.

In some embodiments, the correction unit is configured to, when the current strengthening loss is a negative value, use a larger value between the current strengthening loss and a second prize loss that is a preset multiple as the corrected strengthening loss; if the current strengthening loss is a non-negative value, the current strengthening loss is kept unchanged.

In some embodiments, the step of generating offline sample data for a next offline period is implemented by a generator in the environmental simulator based on the offline sample data for the current offline period and the offline flow control prediction data output by the base flow control model; correspondingly, the device also comprises an environment simulator training module;

The environment simulator training module is used for forming environment sample data by communication state data and coding data corresponding to the same training period in the online acquisition data; for the current training period, determining environmental flow control data corresponding to the current training period based on the coded data of the next training period of the current training period; processing the environmental sample data and the environmental flow control data of the current training period through a generator in the environmental simulator to be trained to generate environmental prediction data of the next training period; determining environmental loss according to the difference between the environmental prediction data of each training period and the environmental sample data of each training period; determining generator loss according to the generation simulation degree when the generator generates the environment prediction data of each training period, and determining discriminator loss according to the discriminator in the environment simulator to be trained when discriminating the environment prediction data and the environment sample data of each training period respectively; based on the environmental loss, the generator loss and the discriminator loss, constructing a target loss function of the environmental simulator to be trained, and performing iterative countermeasure training on the environmental simulator to be trained based on the target loss function until the training stopping condition is reached, so as to obtain the environment simulator with the end of training.

In one embodiment, a computer device is provided, which may be a terminal or a server, and the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, and a communication interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of streaming multimedia data.

It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In some embodiments, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of training a flow control model, the method comprising:

and performing online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to a plurality of online periods respectively until the online training stopping condition is reached, and obtaining a target flow control model suitable for flow control data prediction in the multimedia communication process.

2. The method of claim 1, wherein the basic flow control model comprises a behavior prediction network and a behavior evaluation network, the behavior prediction network and the behavior evaluation network sharing an encoding structure, the behavior prediction network further comprising a multi-headed classification structure, the behavior evaluation network further comprising a single-tasking processing structure; the coding structure comprises at least one self-attention module which is sequentially connected, and each self-attention module comprises a multi-head attention layer and a forward processing layer.

3. The method of claim 1, wherein the basic fluidic model is obtained by pre-training, the step of pre-training comprising:

historical communication state data and historical coding data corresponding to the same historical period in the historical report data are formed into historical sample data;

for the current history period, based on the history coding data of the next history period of the current history period, determining the history flow control reference data corresponding to the current history period, and determining a history accumulated rewards reference value corresponding to the current history period according to the history sample data of the current history period;

taking the historical sample data, the historical flow control reference data and the historical accumulated rewards reference value corresponding to the same historical period as a group of pre-training sample groups;

and pre-training the initial flow control model to be trained according to a plurality of pre-training sample groups until reaching a pre-training stopping condition, and obtaining a basic flow control model.

4. A method according to claim 3, wherein the pre-training the initial flow control model to be trained according to the plurality of pre-training sample sets until reaching a pre-training stop condition, obtaining the basic flow control model, comprises:

Based on a behavior prediction network in an initial flow control model to be trained, processing historical sample data in a pre-training sample group and outputting historical flow control prediction data;

determining a first cross entropy loss according to the historical flow control prediction data and the historical flow control reference data corresponding to each pre-training sample group;

based on a behavior evaluation network in an initial flow control model to be trained, processing historical sample data in the pre-training sample group, and outputting a historical accumulated rewards predicted value;

determining a first prize loss based on the difference between the historical jackpot prize forecast and a corresponding historical jackpot prize reference;

constructing a pre-training loss function based on the first cross entropy loss and the first reward loss;

and pre-training the initial flow control model through the pre-training loss function until reaching a pre-training stopping condition, and obtaining the basic flow control model.

5. The method of claim 1, wherein the offline sample data includes a code rate and a katon rate, and wherein determining the offline jackpot reference value for each offline period based on the offline sample data for each offline period comprises:

For a current offline period, determining the definition of the current offline period according to the coding code rate of the current offline period;

determining the fluency of the current offline period according to the blocking rate of the current offline period;

determining the smoothness of the current offline period according to the coding rate of the current offline period and the coding rate of the next offline period of the current offline period;

calculating an offline rewarding reference value of the current offline period according to the definition, fluency, smoothness, video quality-free reference score and audio quality-free reference score of the current offline period;

an offline jackpot reference value for the current offline period is determined based on the offline jackpot reference value for the next offline period and the offline jackpot reference value for the current offline period.

6. The method of claim 5, wherein calculating the offline bonus reference value for the current offline period based on the sharpness, smoothness, video quality-less reference score, and audio quality-less reference score of the current offline period comprises:

respectively normalizing the definition, fluency, smoothness, video quality-free reference score and audio quality-free reference score of the current offline period;

And weighting according to each normalization result to obtain the offline rewarding reference value of the current offline period.

7. The method according to claim 1, wherein the basic flow control model includes a behavior prediction network and a behavior evaluation network, the performing offline periodic rolling training on the basic flow control model based on offline sample data and offline accumulated rewards reference values corresponding to a plurality of offline periods respectively until an offline training stop condition is reached, and obtaining an intermediate flow control model includes:

processing the offline sample data of the current offline period through a behavior prediction network of the current offline period to obtain offline flow control prediction data of the current offline period;

determining a second cross entropy loss based on offline flow control prediction data of a current offline period, and determining a first offline objective function according to the second cross entropy loss;

processing the offline sample data of the current offline period through a behavior evaluation network of the current offline period to obtain an offline accumulated rewards predicted value of the current period;

determining a second prize loss based on the offline jackpot predicted value of the current offline period and the offline jackpot reference value of the current offline period, and constructing a second offline objective function based on the second prize loss;

Training the behavior prediction network through the first offline objective function, training the behavior evaluation network through the second offline objective function, and stopping until an offline training stopping condition is reached, so as to obtain an intermediate flow control model.

8. The method of claim 7, wherein the method further comprises:

processing the offline sample data of the current offline period through the behavior prediction network of the previous round to obtain offline flow control experience data of the current offline period;

determining a comparison loss based on offline flow control experience data corresponding to a current offline period, probability distribution of the offline flow control experience data, and probability distribution of offline flow control prediction data;

determining the strengthening loss according to the contrast loss and the dominance function value; the dominance function value is the difference value between the offline accumulated rewards reference value and the offline accumulated rewards predicted value of each offline period;

correcting the strengthening loss based on a positive and negative rewarding mechanism to obtain corrected strengthening loss;

the determining a first offline objective function according to the second cross entropy loss comprises:

and constructing a first offline objective function according to the second cross entropy loss and the corrected strengthening loss.

9. The method of claim 8, wherein determining a fortification loss based on the contrast loss and the dominance function value comprises:

calculating a first product between the contrast loss and the dominance function value;

calculating a second product between the truncated value of the contrast loss and the dominance function value, and selecting a smaller value from the first product and the second product as the strengthening loss; the cut-off value refers to a value formed when the contrast loss is limited to a preset interval.

10. The method of claim 8, wherein the correcting the strengthening loss based on the positive and negative rewards mechanism results in a corrected strengthening loss, comprising:

if the current strengthening loss is a negative value, taking a larger value between the current strengthening loss and a second awarding loss with a preset multiple as the corrected strengthening loss;

if the current strengthening loss is a non-negative value, the current strengthening loss is kept unchanged.

11. The method according to any one of claims 1 to 10, wherein the step of generating offline sample data for a next offline period is implemented by a generator in an environmental simulator based on offline sample data for a current offline period and offline flow control prediction data output by a base flow control model, the training step of the environmental simulator comprising:

Communication state data and coding data corresponding to the same training period in the online acquisition data are formed into environment sample data;

for the current training period, determining environmental flow control data corresponding to the current training period based on the coded data of the next training period of the current training period;

processing the environmental sample data and the environmental flow control data of the current training period through a generator in the environmental simulator to be trained to generate environmental prediction data of the next training period;

determining environmental loss according to the difference between the environmental prediction data of each training period and the environmental sample data of each training period;

determining a generator loss according to the generation simulation degree when the generator generates the environment prediction data of each training period, and determining a discriminator loss according to the discriminator in the environment simulator to be trained when the environment prediction data and the environment sample data of each training period are discriminated respectively;

and constructing a target loss function of the environment simulator to be trained based on the environment loss, the generator loss and the discriminator loss, and performing iterative countermeasure training on the environment simulator to be trained based on the target loss function until the simulation training stopping condition is reached, so as to obtain the environment simulator with the training ending.

12. A fluidic model training device, the device comprising:

And the online training module is used for carrying out online period rolling training on the intermediate flow control model based on online sample data and online accumulated rewards reference values corresponding to a plurality of online periods respectively until the online training stopping condition is reached, so as to obtain a target flow control model suitable for carrying out flow control data prediction in the multimedia communication process.

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.