CN114640653A

CN114640653A - Streaming media distribution system and method in video conference

Info

Publication number: CN114640653A
Application number: CN202210206909.7A
Authority: CN
Inventors: 廖建新; 高涵; 张涛; 石峰
Original assignee: EB INFORMATION TECHNOLOGY Ltd
Current assignee: EB INFORMATION TECHNOLOGY Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-17

Abstract

A stream media distribution system and method in video conference, including stream media server, business server and multiple user terminals, dispose MCU apparatus and SFU apparatus on the stream media server: the SFU device forwards the media stream sent by the upper user terminal to other conference upper user terminals and the MCU device, and forwards the mixed media stream sent by the MCU device to all onlooker user terminals; the MCU device synthesizes the media streams sent by the SFU device into a mixed media stream and sends the mixed media stream to the SFU device; the user terminal collects the media stream and sends the media stream to the SFU device when the user is the on-meeting user, and then synthesizes the media streams of other on-meeting user terminals sent by the SFU device into a mixed media stream to be played locally; when the user is a spectator user, the mixed media stream from the SFU device is received and played. The invention belongs to the technical field of information, and can combine an SFU and an MCU to construct a high-performance media topological structure based on a real audio and video conference scene.

Description

Streaming media distribution system and method in video conference

Technical Field

The invention relates to a streaming media distribution system and a streaming media distribution method in a video conference, and belongs to the technical field of information.

Background

With the popularization of mobile internet, the demand of real-time video communication for enterprises and individuals is increasing, and the home telecommuting and remote business activities are becoming popular trends. The video conference has strong interactivity, accelerates the information exchange, overcomes the obstacle of geographic distance, and is an important scene in real-time video communication. Due to the complexity and instability of the network conditions, transmitting high quality low delay video streams becomes a significant challenge in video conferencing scenarios.

The qos (quality of service) of the video conference system depends on the design and implementation of the server to a great extent, and a well-designed streaming media distribution system can reasonably utilize network resources and computing resources, provide high-quality low-delay media streams for each user, and achieve a high-cost-performance streaming media distribution strategy. Currently, there are three main types of communication architectures currently mainstream in a multi-user video conference scene, which are Mesh, SFU, and MCU:

mesh: each party participating in the communication establishes a bi-directional connection with the other parties, and each person sends data to and receives data from all other users, thus forming a mesh structure. The structure has the advantages of no central node, simple realization and server resource saving, but has obvious defects that in the conference of N users, each user needs to upload N-1 parts of data and download N-1 parts of data, the bandwidth consumption is very large for the users, and only small-scale conferences can be supported.

Mcu (multipoint Control unit): a Multipoint Control Unit (MCU) is arranged under the framework and serves as a central node, each user only needs to send own data to an MCU server, and only needs to acquire one path of fused data from the MCU server. In the conference of N users, each user only needs to upload one path of data and download one path of data, and the server only needs to provide the download of the N paths of data, so that the bandwidth requirements of the users and the server are greatly reduced. And the ability of media stream isomerization can be provided by utilizing the encoding and decoding technology, and the requirements of more users can be met. However, the scheme requires the MCU server to perform encoding and decoding and mixed flow processing on all media data, belongs to the behavior of intensive calculation, and has higher requirements on the performance of the server. It is difficult to satisfy the experience of low delay.

SFU (Selective Forward Unit): under the framework, the SFU is like a media stream router, receives audio and video streams of a terminal, and forwards the audio and video streams to other terminals according to requirements. Each user only needs to send own data to the SFU server and acquire all other data from the SFU server, namely, in a conference of N users, each user only needs to upload 1 part of data and download N-1 parts of data, so that the bandwidth of the user is effectively reduced, and compared with the condition that the downlink bandwidth of the server is changed from N to N (N-1) in the MCU mode, the bandwidth requirement of the server is greatly increased. The SFU server does not process the media stream, and directly forwards the media stream, so that when the participants watch multiple paths of videos, the videos may not be synchronized, and the pictures watched by the different participants may not be consistent with each other in the same video stream. But the structure has low resource consumption on the server, the direct forwarding also reduces the delay and improves the real-time property.

According to the analysis of the three structures, the SFU structure and the MCU structure with the central node have stronger control force on the whole conference, and are more favorable for regulating and controlling the video quality, but the two structures have respective advantages and disadvantages: the time delay under the SFU structure is low, but under the condition that the number of participants is large, the bandwidth consumption of the server is large, and the SFU structure is a behavior which is not economic; the MCU structure has small bandwidth consumption, can perform more quality control, but has large calculation amount.

Therefore, how to combine the SFU and MCU structures based on a real audio/video conference scene to construct a high-performance media topology structure, so as to provide a higher-quality media stream for a user, has become one of the technical problems to be solved in the prior art.

Disclosure of Invention

In view of this, an object of the present invention is to provide a streaming media distribution system and method in a video conference, which can combine two structures, namely an SFU and an MCU, based on a real audio/video conference scene to construct a high-performance media topology structure, so as to provide a higher-quality media stream for a user.

In order to achieve the above object, the present invention provides a streaming media distribution system in a video conference, including a streaming media server, a service server, and a plurality of user terminals, where the streaming media server is deployed with an MCU device and an SFU device, when a video conference is created, users participating in the video conference include a conference user and an onlooker user, a user who is sharing a video is a conference user, a user who does not share the video is an onlooker user, a user terminal used by the conference user is simply referred to as a conference user terminal, and a user terminal used by the onlooker user is simply referred to as an onlooker user terminal, where:

the SFU device is used for forwarding the media stream sent by each meeting user terminal in the video conference to all other meeting user terminals and the MCU device and forwarding the mixed media stream sent by the MCU device to all onlooker user terminals in the video conference;

the MCU device is used for carrying out coding and decoding and mixed flow processing on all media streams sent by the SFU device so as to synthesize a mixed media stream, and then sending the synthesized mixed media stream to the SFU device;

the user terminal collects the media stream of the user terminal when the user is the user, sends the media stream to the SFU device, and then carries out coding and decoding and mixed stream processing on the media streams of all other user terminals on the user terminal which are transmitted by the SFU device so as to synthesize a mixed media stream to be played locally; when the user is a spectator user, the mixed media stream from the SFU device is received and played.

In order to achieve the above object, the present invention further provides a streaming media distribution method in a video conference, including a streaming media server, a service server, and a plurality of user terminals, where the streaming media server is deployed with an MCU device and an SFU device, when a video conference is created, users participating in the video conference include a conference user and an onlooker user, a user who is sharing a video is a conference user, a user who does not share the video is an onlooker user, a user terminal used by the conference user is simply referred to as a conference user terminal, and a user terminal used by the onlooker user is simply referred to as an onlooker user terminal, and further including:

step one, a user terminal on each meeting in a video conference respectively collects own media stream and sends the media stream to an SFU device;

step two, the SFU device forwards the media stream sent by each meeting user terminal to all other meeting user terminals and MCU devices in the video conference;

step three, each on-meeting user terminal and MCU device respectively carry out coding and decoding and mixed flow processing on all media streams sent by the SFU device to synthesize a path of mixed media stream, then each on-meeting user terminal locally plays the mixed media stream synthesized by itself, and the MCU device sends the synthesized mixed media stream to the SFU device;

and step four, the SFU device forwards the mixed media stream sent by the MCU device to all onlooker user terminals in the video conference, and the onlooker user terminals receive and play the mixed media stream sent by the SFU device.

Compared with the prior art, the invention has the beneficial effects that: the invention combines two streaming media topological structures of MCU and SFU, constructs a novel high-performance streaming media topological structure, not only reduces the calculated amount of mixed flow in the MCU mode, but also reduces the large consumption of bandwidth in the SFU mode, and simultaneously meets the requirement of users (users sharing media streams and communicating) on low delay in a video conference, and the requirement of users on the video quality isomerization by the users on the looker (with the microphone and the video sharing closed) on the video quality is met, namely, the media topological structure is optimized from multiple aspects of calculated amount, bandwidth consumption, low delay, controllable video quality and the like, the bandwidth is saved as much as possible, the delay is reduced, the calculated amount is reduced, the requirements of different users are met, and the invention is particularly suitable for the video conference with a large number of people; because of the complexity and the diversity of the network and the video content, the heuristic method can not deal with the combination and the emergency of all complex scenes generally, the invention adopts a deep reinforcement learning method, selects the future video bitrate through the observed network state and the past video frame content, and takes the network state and the video content into consideration when making a decision compared with the majority of bitrate adaptive methods which only refer to the network situation to provide the high video bitrate as the aim.

Drawings

Fig. 1 is a schematic structural diagram of a streaming media distribution system in a video conference according to the present invention.

Fig. 2 is a schematic diagram of the structure of a video conference embodiment to which the present invention is applied.

Fig. 3 is a flowchart of a streaming media distribution method in a video conference according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in fig. 1, the streaming media distribution system in a video conference according to the present invention includes a streaming media server, a service server, and a plurality of user terminals, where an MCU device and an SFU device are deployed on the streaming media server, when a video conference is created, users participating in the video conference include a conference user and an onlooker user, a user sharing a video is a conference user, a user not sharing the video is an onlooker user, a user terminal used by the conference user is simply referred to as a conference user terminal, and a user terminal used by the onlooker user is simply referred to as an onlooker user terminal, where:

the service server is used for processing the service information of the video conference;

the MCU device is used for coding and decoding all the media streams sent by the SFU device and mixing the media streams to synthesize a mixed media stream, and then sending the synthesized mixed media stream to the SFU device;

In a real audio and video conference, most people turn off the microphone and the camera, and only a few people who are speaking and preparing to speak turn on the microphone and the camera. That is, the requirements of the meeting users and the spectator users are slightly different, and the meeting users communicate with each other, so that the requirement on real-time performance is higher, and the spectator users have higher requirements on the received media quality. Aiming at different requirements of the two users, the SFU device is used for forwarding the media stream without processing the media stream, and the MCU device is used for processing the media stream and fusing a plurality of paths of media streams into one path of media stream. The user terminal and the service server establish a websocket connection for communication, including login and registration of users, service management of conference rooms and the like, which are all transmitted through websocket messages. RTP/RTCP connections are established between the user terminals and the SFU apparatuses to transport audio streams and video streams. The SFU device and the MCU device establish RTP/RTCP connection for communication. The service server, the SFU device and the MCU device transmit service messages through HTTP.

In the RTP/RTCP connection established between the MCU device and the SFU device, the user terminal and the SFU device, two adjacent ports may be used for listening to the RTP media stream and the RTCP control information stream, respectively, for example: the SFU device is configured to monitor the ports of the ue as: 52304 and 52305, 52304 this port is used to receive RTP media stream from the user terminal, 52305 this port is used to receive RTCP control information stream from the user terminal, the control information may include information about the network conditions between the SFU device and the user terminal, such as real-time bandwidth, packet loss rate and delay.

Fig. 2 shows an embodiment of a video conference to which the invention is applied. As shown in fig. 2, the on-conference user terminal in the video conference includes C1-3, the on-looker user terminal includes C4-6, and the C1, C2, and C3 send the acquired video data S1, S2, and S3 to the SFU device, and the SFU device forwards the video data S1, S2, and S3 to the MCU device for mixing, and also forwards the video data to other on-conference user terminals, for example, S2 and S3 to C1, and the video data S123 mixed by the MCU device is forwarded to all on-looker user devices C4-6 by the SFU device. In this way, the spectator user terminal receives the video S123 of all the conference users mixed by the MCU device, and the conference users receive all other video data through the SFU device and mix locally, thereby obtaining the video of other conference users without their own video. In the scene of the six-person (three meetings and three spectators) video conference shown in fig. 2, compared with a pure SFU structure, the downstream data flow of the SFU device of the invention is reduced from 15 paths to 9 paths, and the bandwidth of the server is reduced by 40%; compared with a pure MCU structure, the number of the paths for mixing flow of the MCU device is reduced from 4 paths to 1 path, and the calculated amount is reduced by 75 percent; the user can directly obtain the video stream without processing, the low time delay is ensured, and meanwhile, the quality of the mixed video stream obtained by each onlooker user is easier to control by using the MCU device. The media topology of the present invention is optimized in terms of computational complexity, bandwidth consumption, low latency, and controllable video quality.

The invention can also deploy a plurality of streaming media servers simultaneously according to load balance to realize distributed deployment, and each streaming media server is deployed with the SFU device and the MCU device which are matched with each other, namely, the SFU device and the MCU device are deployed with the same number of nodes, when a video conference is established, the SFU device nodes are distributed according to load balance, and then the MCU device nodes matched with the SFU device nodes are determined.

When all users in the video conference are the on-users, since there are no onlookers at this time, the pure SFU mode or the pure MCU mode may be adopted: under the pure SFU mode, the MCU device stops mixed flow calculation, and all user terminals obtain media streams of all other users on the meeting; in the pure MCU mode, each user terminal obtains a mixed flow that does not contain its own media stream, which means that the MCU device performs a mixing flow for each user. The invention can have a mode automatic regulating mechanism, namely according to the bandwidth occupation of the SFU device and the calculated amount occupation of the MCU device, the pure SFU or MCU mode selection is automatically carried out, the mode selection is not sensible to the user, but the calculation cost and the bandwidth cost of the server can be effectively optimized, the picture quality of the user is improved, and the delay is reduced, therefore, the service server further comprises:

and the mode control device is used for sending notification messages to the SFU device and the MCU device when detecting that all users in the video conference are the upper users, and then calculating mode adjusting parameters according to the average bandwidth occupancy rate b and the average CPU occupancy rate c returned by the SFU device and the MCU device:

wherein n is the number of user terminals of the current video conference, and accordingly informs the SFU device of: when the mode adjusting parameter is larger than 1, the current bandwidth consumption is large, the bandwidth resource is in short supply, and the SFU device is informed to adopt an MCU mode; when the mode adjusting parameter is less than 1, the CPU calculation consumption is large, the calculation resource is in short supply, the SFU device is informed to adopt the SFU mode,

the SFU apparatus further comprises:

the mode adjusting unit calculates the average bandwidth occupancy rate b of the self process in a period of time according to the notification message sent by the service server, returns the average bandwidth occupancy rate b to the service server, and then executes a corresponding mode according to the notification message of the service server: if the video conference mode is the MCU mode, the received media streams of each channel of user terminals are forwarded to the MCU device, and then the mixed media streams sent by the MCU device are sent to all the user terminals in the video conference; if the mode is SFU mode, forwarding the received media stream of each channel of user terminal to other user terminals in the video conference,

and the MCU device also calculates the average CPU occupancy rate c of the self process in a period of time according to the notification message sent by the service server.

Adaptive bitrate is a method of automatically adapting to the capacity and performance of a network by dynamically adjusting the bitrate of a video stream, providing end users with the best possible video quality, which has become a popular solution for providing users with high quality video. The current major methods for adaptive code rate can be divided into four categories: client buffer based, network throughput (network rate) based, model based and reinforcement learning based, the strategy of these algorithms is to select the bitrate as high as possible when the network conditions allow, but since bitrate has marginal decreasing effect, i.e. as the bitrate increases, the rate of increase of video quality decreases, and the improvement of coding technology cannot solve this problem. This strategy of selecting as high a bit rate as possible results in a waste of network resources and coding computational resources, and ignores the influence of video content characteristics on video quality. If a video consists of dark scenes or few objects, a lower bitrate can also provide a satisfactory video while saving bandwidth resources. Based on the factors, the invention can also adopt a deep reinforcement learning method to select the future code rate, namely, the future video code rate is selected through the observed network state and the past video frame content, thereby carrying out dynamic self-adaptive code rate adjustment according to the video content and effectively reducing the unnecessary bandwidth consumption caused by the marginal decrement effect of the code rate on the basis of improving the video quality. If the original picture is directly input as the input of the state, the state space will cause 'state explosion', and in order to overcome the technical problem, the invention divides the reinforcement learning model into two feasible and useful models: one is a video quality prediction model based on deep learning, and the other is a rate decision model based on reinforcement learning, wherein the first model is used for helping the second model to learn the relation between video rate and video quality under different video contents. Thus, the user terminal or the SFU device may further include:

the video quality prediction model construction unit is used for constructing and training a video quality prediction model, the video quality prediction model is used for predicting the average video quality of the next time period according to the video frames of each time period, the input of the video quality prediction model is a plurality of video frames in one time period, the output of the video quality prediction model is average video quality fraction values which are obtained by prediction and respectively correspond to different code rates in the next time period, the video quality prediction model is composed of CNN (volumetric Neural network) for extracting picture space features and RNN (regenerative Neural network) for extracting time domain features, the CNN outputs time series data of one video frame, the future video quality can be effectively predicted by using the time memory characteristics of the RNN, and the processing flow of the input video frames is as follows: firstly, inputting a video frame to a CNN (compressed natural network) to output and obtain a video frame sequence characteristic, then, continuously inputting the video frame sequence characteristic output by the CNN to an RNN, wherein the output of the RNN is an output value of a video quality prediction model, and the CNN comprises 5 layers which are sequentially: one convolution layer containing 64 convolution kernels of 5 × 5 and one average pooling layer of 3 × 3, another convolution layer containing 64 convolution kernels of 3 × 3 and one average pooling layer of 3 × 3, and finally one maximum pooling layer of 2 × 2, the RNN includes 2 gru (gated current unit) layers with 64 hidden units, the output of the video quality prediction model is a vector, a plurality of different bitrate such as 300, 500, 800, 1100, 1400 can be preset, and each value in the output vector respectively represents the average video quality score value predicted by each bitrate;

and a bit rate decision model construction unit which adopts an A3C algorithm to construct and train a bit rate decision model, wherein the bit rate decision model is used for selecting the bit rate of the next time period according to the state (including network condition, video content, video quality and the like) of each time period, so that the best video quality is provided without wasting resources, and the input is the state space of one time period: s_t＝(p_t-1，v_t，s_t-1，d_t-1，l_t-1) And the output is the predicted selection probability of different code rates in the next time period, wherein S_tState space representing the t-th time segment, p_t-1Representing the average video quality score, v, of the video transmitted during the t-1 th time period before the t-th time period_tRepresenting the average video quality score predictor, s, over the t-th time period_t-1Representing the transmission rate of the video stream in the t-1 th time period, d_t-1Representing the delay gradients of the sender and receiver in the t-1 th time period, l_t-1Represents the packet loss ratio, p, of the sequence of video frames in the t-1 th time period_t-1、v_tThe value of (d) can be the prediction result of the video quality prediction model for the t-1 th and t time periods, s_t-1The value of (d) may be the rate decision model for the t-Prediction of 1 time segment, d_t-1、l_t-1The value of (2) can be obtained from RTCP control information interacted between a sender and a receiver;

and the adaptive code rate unit selects the code rate of the next time period for the sender and the receiver of the media stream through the trained video quality prediction model and code rate decision model when the device to which the self belongs sends the media stream to other devices (namely the user terminal sends the media stream to the SFU device or the SFU device sends the media stream to each bystander user terminal): inputting a video frame received by a device (namely a meeting user terminal or an SFU device) belonging to the device into a video quality prediction model in each time period, acquiring and storing an output result of the video quality prediction model corresponding to each time period, then extracting RTCP control information of a sender and a receiver in the last time period, an output result of the video quality prediction model in the last time period and the current time period and an output result of a code rate decision model in the last time period, constructing and acquiring a state space of each time period, inputting the constructed state space of each time period into the code rate decision model, thereby acquiring the output result of the code rate decision model corresponding to each time period, selecting a code rate with the maximum probability from the output results, and sending a media stream to the receiver at the selected code rate in the next time period.

As shown in fig. 3, the streaming media distribution method in a video conference according to the present invention includes a streaming media server, a service server, and a plurality of user terminals, where the streaming media server is deployed with an MCU device and an SFU device, when a video conference is created, users participating in the video conference include a conference user and an onlooker user, a user sharing a video is a conference user, a user not sharing the video is an onlooker user, a user terminal used by the conference user is simply referred to as a conference user terminal, and a user terminal used by the onlooker user is simply referred to as an onlooker user terminal, and further includes:

When all users in the video conference are the on-users, since there are no onlookers at this time, the pure SFU mode or the pure MCU mode may be adopted: under the pure SFU mode, the MCU device stops mixed flow calculation, and all user terminals obtain media streams of all other users on the meeting; in the pure MCU mode, each user terminal obtains a mixed flow that does not contain its own media stream, which means that the MCU device performs a mixing flow for each user. The invention can have a mode automatic regulating mechanism, namely according to the bandwidth occupation of the SFU device and the calculated amount occupation of the MCU device, the pure SFU or MCU mode selection is automatically carried out, the mode selection is not sensible to the user, but the calculation cost and the bandwidth cost of the server can be effectively optimized, the picture quality of the user is improved, and the delay is reduced, the invention also can comprise:

step A1, when detecting that all users in the video conference are the uploading users, the service server sends notification messages to the SFU device and the MCU device;

step A2, the SFU device and the MCU device respectively calculate the average bandwidth occupancy rate b and the average CPU occupancy rate c of the self process in a period of time and return the average bandwidth occupancy rate b and the average CPU occupancy rate c to the service server;

step a3, the service server calculates the mode adjustment parameter:

wherein n is the number of user terminals of the current video conference, and accordingly informs the SFU device of: when the mode adjusting parameter is larger than 1, the current bandwidth consumption is large, the bandwidth resource is in short supply, and the SFU device is informed to adopt the MCU mode; when the mode adjusting parameter is less than 1, the current CPU is large in calculation consumption and is in short supply of calculation resources, and the SFU device is informed of adopting an SFU mode;

step A4, the SFU device executes corresponding mode according to the notification message of the service server: if the mode is the MCU mode, the received media stream of each channel of user terminal is forwarded to the MCU device, and then the mixed media stream sent by the MCU device is sent to all user terminals in the video conference; and if the video conference is in the SFU mode, forwarding the received media stream of each channel of user terminal to other user terminals in the video conference.

The invention can also adopt a deep reinforcement learning method to select the future code rate, namely, the future video code rate is selected through the observed network state and the past video frame content, thereby carrying out dynamic self-adaptive code rate adjustment according to the video content and effectively reducing the unnecessary bandwidth consumption caused by the marginal decrement effect of the code rate on the basis of improving the video quality. The user terminal or the SFU device can further comprise a video quality prediction model and a code rate decision model, and the invention can also comprise:

step A, constructing and training a video quality prediction model, wherein the video quality prediction model is used for predicting the average video quality of the next time period according to the video frame of the current time period, the input of the video quality prediction model is a plurality of video frames in the time period, the output of the video quality prediction model is average video quality fraction values which are obtained by prediction and respectively correspond to different code rates in the next time period, the video quality prediction model is composed of a CNN for extracting picture space characteristics and an RNN for extracting time domain characteristics, the CNN outputs time series data of one video frame, the future video quality can be effectively predicted by using the time memory characteristics of the RNN, and the processing flow of the input video frame is as follows: firstly, inputting a video frame to a CNN (compressed natural network) to output and obtain a video frame sequence characteristic, then, continuously inputting the video frame sequence characteristic output by the CNN to an RNN, wherein the output of the RNN is an output value of a video quality prediction model, and the CNN comprises 5 layers which are sequentially: one convolution layer containing 64 convolution kernels with 5 × 5 convolution kernels and one 3 × 3 average pooling layer, another convolution layer containing 64 convolution kernels with 3 × 3 convolution kernels and one 3 × 3 average pooling layer, and finally one 2 × 2 maximum pooling layer, wherein the RNN comprises 2 GRU layers with 64 hidden units, the output of the video quality prediction model is a vector, a plurality of different code rates such as 300, 500, 800, 1100, 1400 and the like can be preset, and each value in the output vector respectively represents an average video quality score value predicted by each code rate;

step B, adopting an A3C algorithm to construct and train a code rate decision model, wherein the code rate decision model is used for selecting the code rate of the next time period according to the state of each time period, so that the best video quality is provided under the condition of not wasting resources, and the input is the state space of one time period: s_t＝(p_t-1，v_t，s_t-1，d_t-1，l_t-1) And the output is the predicted selection probability of different code rates in the next time period, wherein S_tState space representing the t-th time segment, p_t-1Representing the average video quality score, v, of the video transmitted during the t-1 th time period before the t-th time period_tRepresenting the average video quality score predictor, s, over the t-th time period_t-1Representing the transmission rate of the video stream in the t-1 th time period, d_t-1Denotes the delay gradient of the sender and receiver in the t-1 th time period, l_t-1Represents the packet loss ratio, p, of the sequence of video frames in the t-1 th time period_t-1、v_tThe value of (d) may be the predicted result of the video quality prediction model for the t-1 th and t-th time periods, s_t-1The value of (d) can be the predicted result of the code rate decision model for the t-1 time segment_t-1、l_t-1The value of (d) can be obtained from RTCP control information exchanged between the sender and the receiver,

thus, in the step one, before the user terminal sends the media stream to the SFU device, or in the step four, before the SFU device sends the mixed media stream to each spectator user terminal, the present invention further includes:

selecting the code rate of the next time period for the sender and the receiver of the media stream (namely the user terminal and the SFU device or the SFU device and each onlooker user terminal) through the trained video quality prediction model and code rate decision model: inputting a video frame received by a device (namely a meeting user terminal or an SFU device) belonging to the device into a video quality prediction model in each time period, acquiring and storing an output result of the video quality prediction model corresponding to each time period, then extracting RTCP control information of a sender and a receiver in the last time period, an output result of the video quality prediction model in the last time period and the current time period and an output result of a code rate decision model in the last time period, constructing and acquiring a state space of each time period, inputting the constructed state space of each time period into the code rate decision model, thereby acquiring the output result of the code rate decision model corresponding to each time period, selecting a code rate with the maximum probability from the output results, and sending a media stream to the receiver at the selected code rate in the next time period.

In order to explain the implementation techniques of the video quality prediction model and the bitrate decision model in the present invention more clearly, the two models will be further described in detail below:

(1) video quality prediction model

To help the rate decision model select the appropriate coding rate for the next frame, it is necessary to first let the model "know" the relationship between rate and corresponding video quality. However, this form of prediction is quite challenging, as the perceived video quality is closely related to the video itself. The video type, brightness and degree of motion all have a large impact on the correlation between bitrate and video quality. Based on the effectiveness of the neural network in predicting time series data, the video quality prediction model is designed, and the code rate decision model can be helped to predict the perceptual video quality of future frames.

In the Video quality prediction model, the Video multi-method Assessment Fusion (VMAF) is adopted as the measure of the Video quality, and the mode combines a plurality of basic quality indexes to predict the subjective quality. The average video quality is used to measure the quality of the video over a period of time. Recording the video quality of the ith frame of the video as V in a time period t_{fi，bitarate}Represents this video frame f_iVideo quality under coding at this rate. By means of V_{fi，bitarate}Is the average value of (i.e. V)_t，bitarateTo represent the average video quality in the time period t, the measurement of the video quality adopts the standard of VMAF, and is normalized, and the video quality scores are distributed in [0, 1 ]]。

The present invention can use the mean square error to represent the loss function:

where θ is the neural network parameter of the video quality prediction model, V_tIs the average video quality score predictor output by the t-th time period model,

is the actual average video quality score value for the t-th time period, N is the total number of time periods calculated, and γ is the value of the adjustment factor γ to the loss function, which can be obtained from multiple experiments. In addition, adjustment is performed in consideration of increasing a loss function to reduce the probability of overfitting of the training set, where λ is an adjustment coefficient.

(2) Code rate decision model

The code rate decision model can adopt an A3C (Asynchronous adaptive Actor-criticc) algorithm, and compared with the Actor-criticc algorithm, the A3C provides an Asynchronous training framework which can accelerate convergence. Under the scene of reinforcement learning, agent is in a certain state S_tTake the corresponding action A_tThen get the corresponding reward, the goal is to choose an actionTo maximize reward, the policy network would change parameters in this direction to achieve this goal. The strategy network is trained by adopting a policy gradient method. Wherein:

1) state space S_t

The sender, as agent of the RL problem, observes a set of metrics including future video quality and previous network state as a state space S_tThe neural network then selects the action as output, representing the video bitrate for the next time period.

2) Action space A_t

When agent receives the state, it will take action according to policy requirements, generally the action space is discrete, and the output of the policy network is a probability distribution f(s)_t，a_t) Is shown in state s_tTime selection a_tThe probability of this action, here the action space, contains the bit rate information of the candidates in the next time period t. Because of the large state space, a neural network is used to represent the strategy, and the weights of the neural network are referred to as strategy parameters θ.

3) Reward function Reward

The goal is to achieve higher video quality at as small a bit rate as possible, so QoE mainly evaluates four factors, video quality, transmitted bit rate, loss rate, delay gradient, and smoothness of video switching. Compared with the original algorithm, the invention redesigns the following QoE:

in the calculation, V_t、V_t-1Respectively representing average video quality score predicted values output by the video quality prediction model in t-th and t-1-th time periods, B_tCode rate, D, representing sender selection during the t-th time period_tRepresenting the delay gradient measured by the receiver for the t-th time period, L_tThe packet loss rate of a receiving party in the T-th time period is shown, T represents the number of the calculated time periods, | V_t-V_t-1L represents the smoothness of the video quality, and α, β, δ, γ represent weights for measuring different indexes.

4) Network model for code rate decision

The output of the Actor network is a 5-dimensional vector with softmax activation, while the final output of the criticc network is a linear value without an activation function, and the decision of the Actor network is evaluated. The two networks use the same network structure, the state features are extracted through three layers of neural networks, and multi-dimensional feature vector fusion processing is carried out in a full-connection mode.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The streaming media distribution system in the video conference is characterized by comprising a streaming media server, a service server and a plurality of user terminals, wherein an MCU device and an SFU device are deployed on the streaming media server, when a video conference is created, users participating in the video conference comprise conference users and onlooker users, the users sharing videos are the onlooker users, the users not sharing videos are the onlooker users, the user terminals used by the conference users are the conference users for short, and the user terminals used by the onlooker users are the onlooker user terminals for short, wherein:

2. The system of claim 1, wherein multiple streaming servers are deployed simultaneously for distributed deployment based on load balancing, and each streaming server is deployed with an SFU device and an MCU device.

3. The system of claim 1, wherein the service server further comprises:

and the mode control device is used for sending notification messages to the SFU device and the MCU device when detecting that all users in the video conference are the last users, and then calculating the mode adjusting parameters according to the average bandwidth occupancy rate b and the average CPU occupancy rate c returned by the SFU device and the MCU device:

wherein n is the number of user terminals of the current video conference, and accordingly notifies the SFU device: when the mode adjusting parameter is larger than 1, informing the SFU device to adopt an MCU mode; when the mode adjustment parameter is less than 1, the SFU device is informed to adopt the SFU mode,

the SFU apparatus further comprises:

the mode adjusting unit calculates the average bandwidth occupancy rate b of the self process in a period of time according to the notification message sent by the service server, returns the average bandwidth occupancy rate b to the service server, and then executes a corresponding mode according to the notification message of the service server: if the mode is the MCU mode, the received media stream of each channel of user terminal is forwarded to the MCU device, and then the mixed media stream sent by the MCU device is sent to all user terminals in the video conference; if the mode is SFU mode, forwarding the received media stream of each channel of user terminal to other user terminals in the video conference,

4. The system of claim 1, wherein the user terminal or SFU device comprises:

the video quality prediction model construction unit is used for constructing and training a video quality prediction model, the video quality prediction model is used for predicting the average video quality of the next time period according to the video frames of each time period, the input of the video quality prediction model is a plurality of video frames in one time period, the output of the video quality prediction model is average video quality fraction values which are obtained through prediction and respectively correspond to different code rates in the next time period, the video quality prediction model is composed of CNN for extracting picture space characteristics and RNN for extracting time domain characteristics, and the processing flow of the input video frames is as follows: firstly, inputting a video frame to a CNN (China network node) to output and obtain a video frame sequence characteristic, and then continuously inputting the video frame sequence characteristic output by the CNN to an RNN (radio network node), wherein the output of the RNN is an output value of a video quality prediction model;

a code rate decision model constructing unit, which constructs and trains a code rate decision model by adopting an A3C algorithm, wherein the code rate decision model is used for selecting the code rate of the next time period according to the state of each time period, and the input of the code rate decision model is the state space of one time period: s_t＝(p_t-1,v_t,s_t-1,d_t-1,l_t-1) And the output is the predicted selection probability of different code rates in the next time period, wherein S_tState space representing the t-th time segment, p_t-1Representing the average video quality score, v, of the video transmitted during the t-1 th time period before the t-th time period_tMeans average video quality score predicted value, s, in t time period_t-1Representing the transmission rate of the video stream in the t-1 th time period, d_t-1Denotes the delay gradient of the sender and receiver in the t-1 th time period, l_t-1Represents the packet loss ratio, p, of the sequence of video frames in the t-1 th time period_t-1、v_tThe value of (1) is the prediction result of the video quality prediction model for the t-1 th time segment and the t time segment, s_t-1The value of (d) is the predicted result of the code rate decision model for the t-1 th time segment_t-1、l_t-1The value of (2) is obtained from RTCP control information interacted between a sender and a receiver;

and the self-adaptive code rate unit selects the code rate of the next time period for the sender and the receiver of the media stream through the trained video quality prediction model and code rate decision model when the device to which the self belongs sends the media stream to other devices: inputting a video frame received by a device to which the device belongs in each time period into a video quality prediction model, acquiring and storing an output result of the video quality prediction model corresponding to each time period, then extracting RTCP control information of a sender and a receiver in the last time period, output results of the video quality prediction model in the last time period and the current time period, and an output result of a code rate decision model in the last time period, constructing and acquiring a state space of each time period, inputting the constructed state space of each time period into the code rate decision model, thereby acquiring the output result of the code rate decision model corresponding to each time period, selecting a code rate with the maximum selection probability from the output results, and sending a media stream to the receiver with the selected code rate in the next time period.

5. The system of claim 4, wherein the CNN comprises 5 layers in the video quality prediction model, and the CNN comprises the following layers: one convolution layer containing 64 5 x 5 convolution kernels and one 3 x 3 average pooling layer, another convolution layer containing 64 3 x 3 convolution kernels and one 3 x 3 average pooling layer, and finally one 2 x 2 maximum pooling layer; the RNN comprises 2 GRU layers with 64 hidden units; and the loss function is represented using the mean square error:

where θ is the neural network parameter of the video quality prediction model, V_tIs the average video quality score predictor output by the t-th time segment model,

is the actual average video quality of the t-th time periodFractional value, N is the total number of time segments calculated, γ is the adjustment factor for the loss function, and furthermore, adjustment is made by adding the loss function, where λ is the adjustment factor,

in the rate decision model, the reward function QoE is designed as follows:

wherein, V_t、V_t-1Respectively representing average video quality score predicted values output by the video quality prediction model in t-th and t-1-th time periods, B_tCode rate, D, representing sender selection during the t-th time period_tRepresenting the delay gradient measured by the receiver for the t-th time period, L_tThe packet loss rate of a receiving party in the T-th time period is shown, T represents the number of the calculated time periods, | V_t-V_t-1The | represents the smoothness of the video quality, and the alpha, the beta, the delta and the gamma represent the weights for measuring different indexes; and the Actor network and the Critic network extract state characteristics through three layers of neural networks and perform multi-dimensional characteristic vector fusion processing in a full-connection mode.

6. A streaming media distribution method in a video conference is characterized by comprising a streaming media server, a service server and a plurality of user terminals, wherein an MCU device and an SFU device are deployed on the streaming media server, when a video conference is created, users participating in the video conference comprise conference users and onlooker users, users who are sharing videos are the conference users, users who do not share videos are the onlooker users, user terminals used by the conference users are referred to as conference users for short, and user terminals used by the onlooker users are referred to as onlooker user terminals for short, and the method further comprises the following steps:

7. The method of claim 6, wherein multiple streaming media servers are deployed simultaneously for distributed deployment according to load balancing, and each streaming media server is deployed with an SFU device and an MCU device.

8. The method of claim 6, further comprising:

step A1, when detecting that all users in the video conference are the on-line users, the service server sends notification messages to the SFU device and the MCU device;

step a3, the service server calculates the mode adjustment parameter:

wherein n is the number of user terminals of the current video conference, and accordingly notifies the SFU device: when the mode adjusting parameter is larger than 1, informing the SFU device to adopt an MCU mode; when the mode adjusting parameter is less than 1, informing the SFU device to adopt the SFU mode;

9. The method of claim 6, wherein the video quality prediction model and the rate decision model are included in the user terminal or the SFU device, and comprise:

step A, constructing and training a video quality prediction model, wherein the video quality prediction model is used for predicting the average video quality of the next time period according to the video frame of the current time period, the input of the video quality prediction model is a plurality of video frames in one time period, the output of the video quality prediction model is the predicted average video quality fraction values corresponding to different code rates in the next time period, the video quality prediction model is composed of a CNN for extracting picture space characteristics and an RNN for extracting time domain characteristics, and the processing flow of the input video frames is as follows: firstly, inputting a video frame to the CNN to output and obtain a video frame sequence characteristic, and then continuously inputting the video frame sequence characteristic output by the CNN to the RNN, wherein the output of the RNN is an output value of a video quality prediction model;

step B, adopting an A3C algorithm to construct and train a code rate decision model, wherein the code rate decision model is used for selecting the code rate of the next time period according to the state of the current time period, and the input is the state space of one time period: s_t＝(p_t-1,v_t,s_t-1,d_t-1,l_t-1) And the output is the predicted selection probability of different code rates in the next time period, wherein S_tState space representing the t-th time segment, p_t-1Representing the average video quality score, v, of the video transmitted during the t-1 th time period before the t-th time period_tMeans average video quality score predicted value, s, in t time period_t-1Representing the transmission rate of the video stream in the t-1 th time period, d_t-1Denotes the delay gradient of the sender and receiver in the t-1 th time period, l_t-1Represents the packet loss ratio, p, of the sequence of video frames in the t-1 th time period_t-1、v_tThe value of (1) is the prediction result of the video quality prediction model for the t-1 th time segment and the t time segment, s_t-1The value of (d) is the predicted result of the code rate decision model for the t-1 th time segment_t-1、l_t-1Is obtained from RTCP control information exchanged between the sender and the receiver,

before the sending of the media stream to the SFU device by the user terminal in the first step, or before the sending of the mixed media stream to each spectator user terminal by the SFU device in the fourth step, the method further includes:

selecting the code rate of the next time period for the sender and the receiver of the media stream through the trained video quality prediction model and code rate decision model: inputting a video frame received by a device to which the device belongs in each time period into a video quality prediction model, acquiring and storing an output result of the video quality prediction model corresponding to each time period, then extracting RTCP control information of a sender and a receiver in the last time period, output results of the video quality prediction model in the last time period and the current time period, and an output result of a code rate decision model in the last time period, constructing and acquiring a state space of each time period, inputting the constructed state space of each time period into the code rate decision model, thereby acquiring the output result of the code rate decision model corresponding to each time period, selecting a code rate with the maximum selection probability from the output results, and sending a media stream to the receiver with the selected code rate in the next time period.

10. The method of claim 9, wherein the CNN comprises 5 layers in the video quality prediction model, and sequentially: one convolutional layer containing 64 5 x 5 convolutional kernels and one 3 x 3 average pooling layer, another convolutional layer containing 64 3 x 3 convolutional kernels and one 3 x 3 average pooling layer, and finally one 2 x 2 maximum pooling layer; the RNN comprises 2 GRU layers with 64 hidden units; and the loss function is represented using the mean square error:

is the actual average video quality score value for the t-th time segment, N is the total number of time segments calculated, y is the adjustment factor for the loss function, and furthermore, adjustments are made to add the loss function, where x is the adjustment factor,

in the rate decision model, the reward function QoE is designed as follows:

wherein, V_t、V_t-1Respectively representing the average video quality score predicted values output by the video quality prediction model in the t-th and t-1 time periods, B_tCode rate, D, representing sender selection during the t-th time period_tRepresenting the delay gradient measured by the receiver for the t-th time period, L_tThe packet loss rate of a receiving party in the T-th time period is shown, T represents the number of the calculated time periods, | V_t-V_t-1L represents the smoothness of the video quality, and alpha, beta, delta and gamma represent weights for measuring different indexes; and the Actor network and the Critic network extract state characteristics through three layers of neural networks and perform multi-dimensional characteristic vector fusion processing in a full-connection mode.