CN110855908A

CN110855908A - Multi-party video screen mixing method and device, network equipment and storage medium

Info

Publication number: CN110855908A
Application number: CN201911128504.0A
Authority: CN
Inventors: 周骏华; 王乐才; 方华; 宋钦梅
Original assignee: Zhongchang (hangzhou) Information Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Zhongchang (hangzhou) Information Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-02-28
Anticipated expiration: 2039-11-18
Also published as: CN110855908B

Abstract

The embodiment of the invention relates to the technical field of communication, and discloses a multi-party video screen mixing method, which comprises the following steps: acquiring coding frames of N videos to be mixed and an input frame rate of each video, wherein N is a natural number greater than 1; decoding encoded frames of videos with different input frame rates and preset frame rates to obtain decoded frames, and acquiring characteristic parameters of the videos with the different input frame rates and the preset frame rates; inputting the characteristic parameters into a motion complexity model to obtain the motion complexity of the video with the input frame rate different from the preset frame rate; performing frame interpolation or frame polishing processing on the decoded frame according to a preset frame rate and motion complexity to obtain a processed decoded frame; and synthesizing the mixed screen video according to the processed decoded frames. The embodiment of the invention also provides a multi-party video screen mixing device, network equipment and a storage medium. The multi-party video screen mixing method, the multi-party video screen mixing device, the network equipment and the storage medium can improve the display effect of the communication video of the mixed screen.

Description

Multi-party video screen mixing method and device, network equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a multi-party video mixing method, apparatus, network device, and storage medium.

Background

In a scenario of multiple video communications, such as a multiparty video conference, multiple videos for video communication need to be mixed in one screen to facilitate viewing and interaction of users of video communication.

However, since the types of the plurality of terminals performing video communication may be different, the frame rates of the plurality of videos may be different accordingly; and the different frame rates can cause that the videos after the screen mixing cannot be synchronized, and the display effect of the communication video of the screen mixing is influenced.

Disclosure of Invention

The embodiment of the invention aims to provide a multi-party video screen mixing method, a multi-party video screen mixing device, network equipment and a storage medium, which can improve the display effect of a communication video of screen mixing.

In order to solve the above technical problem, an embodiment of the present invention provides a multi-party video mixing method, including the following steps: acquiring coding frames of N videos to be mixed and an input frame rate of each video, wherein N is a natural number greater than 1; decoding encoded frames of videos with different input frame rates and preset frame rates to obtain decoded frames, and acquiring characteristic parameters of the videos with the different input frame rates and the preset frame rates; inputting the characteristic parameters into a motion complexity model to obtain the motion complexity of the video with the input frame rate different from the preset frame rate; performing frame interpolation or frame polishing processing on the decoded frame according to a preset frame rate and motion complexity to obtain a processed decoded frame; and synthesizing the mixed screen video according to the processed decoded frames.

The embodiment of the invention also provides a multi-party video screen mixing device, which comprises: the device comprises a coded frame acquisition module, a frame mixing module and a frame mixing module, wherein the coded frame acquisition module is used for acquiring coded frames of N videos to be mixed and the input frame rate of each video, and N is a natural number greater than 1; the parameter acquisition module is used for decoding the coding frame of the video with the input frame rate different from the preset frame rate to obtain a decoding frame and acquiring the characteristic parameter of the video with the input frame rate different from the preset frame rate; the complexity obtaining module is used for inputting the characteristic parameters into the motion complexity model to obtain the motion complexity of the video with the input frame rate different from the preset frame rate; the frame processing module is used for performing frame interpolation or frame polishing processing on the decoded frame according to a preset frame rate and motion complexity to obtain a processed decoded frame; and the video synthesis module is used for synthesizing the mixed screen video according to the processed decoded frame.

An embodiment of the present invention further provides a network device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the multi-party video mixing method.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the multi-party video screen mixing method.

Compared with the prior art, the embodiment of the invention has the advantages that the frame interpolation or frame polishing processing is carried out on the videos with the input frame rates different from the preset frame rate, so that the input frame rates of the processed videos are the same as or close to the preset frame rate, the frame rates of the videos after screen mixing are the same as or close to each other, and the synchronism of the videos after screen mixing is improved; meanwhile, the motion complexity of the video is obtained through the motion complexity model, and when the video is subjected to frame interpolation or frame polishing, the frame interpolation or frame polishing is carried out in combination with the motion complexity of the video, so that the inserted frames or the dropped frames are more reasonable, the video after screen mixing is more in line with the real situation, and the display effect of the communication video of the screen mixing is improved.

In addition, the frame interpolation or frame polishing processing is performed on the decoded frame according to the preset frame rate and the motion complexity, and the method comprises the following steps: if the input frame rate is greater than the preset frame rate, performing frame polishing processing on the decoded frame, wherein the frame polishing processing is performed on the decoded frame with the motion complexity less than or equal to a first preset value preferentially; and if the input frame rate is less than the preset frame rate, performing frame interpolation on the decoded frame, wherein the frame interpolation is preferentially performed on the decoded frame with the motion complexity greater than or equal to a second preset value.

In addition, acquiring characteristic parameters of a video with an input frame rate different from a preset frame rate specifically comprises the following steps: and acquiring a macro block motion vector, a macro block coding type and a frame-level global motion vector of a video with an input frame rate different from a preset frame rate as characteristic parameters. Because the macro block motion vector, the macro block coding type and the frame-level global motion vector of the video can reflect the motion information of the video, the motion information can be input into the motion complexity model by taking the macro block motion vector, the macro block coding type and the frame-level global motion vector as characteristic parameters, and more accurate motion complexity can be obtained.

In addition, before inputting the feature parameters into the motion complexity model, the method further includes: acquiring a video training sample, and extracting characteristic parameters of the video training sample; inputting the characteristic parameters of the video training samples into a deep learning model for training; and taking the trained deep learning model as a motion complexity model. The video training samples are input into the deep learning model for training to obtain the motion complexity model, so that the motion complexity of videos with different input frame rates and preset frame rates can be obtained according to the motion complexity model, and when the videos are subjected to frame interpolation or frame polishing, the motion complexity of the videos is combined for frame interpolation or frame polishing, so that the mixed-screen videos can better accord with the actual motion conditions, and the synchronism of the mixed-screen videos is improved.

In addition, the characteristic parameters of the video training samples are input into the deep learning model for training, and the training specifically comprises the following steps: and inputting the characteristic parameters of the video training samples into an open source deep learning framework for training.

In addition, acquiring the encoding frames of the N videos to be mixed and the input frame rate of each video includes: and respectively acquiring the input frame rate of each video by adopting Kalman filtering. The Kalman filtering can acquire the change range of the input frame rate of the video, so that the input frame rate of the video can be finely adjusted according to the change range, the input frame rate of the video is more consistent with the actual condition, and the synchronism and the fluency of the mixed-screen video are more favorably improved.

In addition, the method for synthesizing the mixed screen video according to the processed decoded frame comprises the following steps: and synthesizing the mixed screen video according to the processed decoded frame and the coded frame of the video with the input frame rate being the same as the preset frame rate.

Drawings

One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.

Fig. 1 is a schematic flow chart of a multi-party video mixing method according to a first embodiment of the present invention;

fig. 2 is a schematic flowchart of S104 refinement in the multi-party video mixing method according to the first embodiment of the present invention;

fig. 3 is a schematic flowchart of a multi-party video mixing method according to a first embodiment of the present invention before S103;

fig. 4 is a schematic block diagram of a multi-party video mixing device according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a network device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The first embodiment of the invention relates to a multi-party video screen mixing method, which comprises the steps of acquiring coding frames of N videos to be mixed and the input frame rate of each video; decoding encoded frames of videos with different input frame rates and preset frame rates to obtain decoded frames, and acquiring characteristic parameters of the videos with the different input frame rates and the preset frame rates; inputting the characteristic parameters into a motion complexity model to obtain the motion complexity of the video with the input frame rate different from the preset frame rate; performing frame interpolation or frame polishing processing on the decoded frame according to a preset frame rate and motion complexity to obtain a processed decoded frame; and synthesizing the mixed screen video according to the processed decoded frames. By carrying out frame insertion and frame polishing processing on the decoded frames different from the preset frame rate, the synchronism of the mixed screen video can be good when the mixed screen video is displayed; meanwhile, the motion complexity of the video is obtained through the motion complexity model, frame inserting or frame throwing processing is carried out according to the motion complexity of the video, the inserted frames or the thrown frames can be more reasonable, the mixed screen video can better accord with the real motion situation, and the display smoothness is improved.

It should be noted that the main implementation body of the embodiment of the present invention is a server for receiving a video, where the server may be implemented by an independent server or a server cluster composed of a plurality of servers, and the following description takes the server as an example.

The specific flow of the multi-party video mixing method provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:

s101: acquiring coding frames of N videos to be mixed and an input frame rate of each video, wherein N is a natural number greater than 1.

The input frame rate refers to a frame rate of a video received by the server. Alternatively, the input frame rate of each video may be estimated using a low-pass filtering method. Preferably, kalman filtering may be employed to obtain the input frame rate of each video separately. Among them, Kalman filter is a highly efficient recursive filter (autoregressive filter) that can estimate the state of a dynamic system from a series of incomplete and noisy measurements.

It should be understood that, due to the instability of the network environment, the input frame rate of the video may be affected by the network environment to change, and the purpose of using the low-pass filtering method is to obtain a relatively stable value of the input frame rate of the video. By adopting the Kalman filtering method, the relatively stable value of the input frame rate of the video can be obtained, and the variation range of the input frame rate caused by the influence of the network environment can also be obtained. When the network environment is unstable, for example, becomes very jammed, the input frame rate of the video should be finely adjusted. The kalman filtering may obtain a change range of the frame rate of the video, so that the input frame rate of the video may be fine-tuned according to the change range, for example, if a relatively stable frame rate value of a certain video is 25, the input frame rate of the video may be fine-tuned to 26, 27, or 24 according to the change range during fine tuning, so that the input frame rate of the video better conforms to an actual situation, and the method is more favorable for improving the synchronization and the fluency when the mixed-screen video is displayed.

Specifically, the server receives RTP packets of N videos to be mixed, analyzes the RTP packets to obtain coding frames of the videos, and estimates the input frame rate of the videos by adopting a preset low-pass filtering method to obtain the input frame rate of each video. For example, after the server analyzes the RTP packet of each video to obtain the encoded frame of the video, the input frame rate of each video can be obtained by calling a kalman filtering method.

S102: decoding the coding frame of the video with the input frame rate different from the preset frame rate to obtain a decoding frame, and acquiring the characteristic parameters of the video with the input frame rate different from the preset frame rate.

The preset frame rate may be specifically set according to an actual situation, and is not limited herein. For example, the input frame rates of N videos may be averaged, and the average value is used as a preset frame rate; or acquiring the input frame rate with the maximum number in the N videos as the preset frame rate, and the like. The characteristic parameter of the video refers to a parameter representing a video motion complexity, and optionally, the characteristic parameter of the video with the input frame rate different from the preset frame rate refers to a macroblock motion vector, a macroblock coding type and a frame-level global motion vector of the video, because the macroblock motion vector, the macroblock coding type and the frame-level global motion vector of the video can reflect the video motion complexity, and may also be other characteristic parameters of the video, which is not limited specifically here.

Optionally, the server first determines the input frame rates of the N videos, and if the input frame rates of the videos are the same as the preset frame rate, the server does not process the videos; and if the input frame rate of the video is different from the preset frame rate, decoding the coded frame of the video with the input frame rate different from the preset frame rate to obtain a decoded frame. Because the video with the input frame rate being the same as the preset frame rate does not need to be subjected to subsequent frame interpolation or frame polishing, the server only needs to acquire the characteristic parameters of the video with the input frame rate being different from the preset frame rate.

Specifically, the server may decode a video with an input frame rate different from a preset frame rate by using a general video coding and decoding open source library to obtain a decoded frame. The general video codec open source library is, for example, ffmpeg, Xvid, X264, or ffdshow. Optionally, the server may also use other video coding and decoding tools to decode the video, which is not limited herein. When decoding a video, the server may obtain various motion information of the video as the characteristic parameters of the video.

S103: and inputting the characteristic parameters into the motion complexity model to obtain the motion complexity of the video with the input frame rate different from the preset frame rate.

The motion complexity model can be obtained by firstly obtaining the characteristic parameters of a plurality of video samples and then inputting the characteristic parameters of the plurality of video samples into a pre-constructed neural network for training. The neural network may be, for example, a deep neural network, a convolutional neural network, a deep confidence network, a recurrent neural network, and the like.

Specifically, the server inputs the acquired characteristic parameters of the video with the input frame rate different from the preset frame rate into the motion complexity model, so that the motion complexity of the video with the input frame rate different from the preset frame rate can be obtained.

S104: and performing frame interpolation or frame polishing processing on the decoded frame according to the preset frame rate and the motion complexity to obtain the processed decoded frame.

The frame insertion processing refers to inserting a decoding frame to increase the input frame rate of the video, and the frame throwing processing refers to throwing off the decoding frame to decrease the input frame rate of the video.

Specifically, the server compares the input frame rate of the video with a preset frame rate, and if the input frame rate of the video is greater than the preset frame rate, the video needs to be subjected to frame polishing processing, so that the input frame rate of the video subjected to frame polishing processing is equal to or close to the preset frame rate; if the input frame rate of the video is less than the preset frame rate, frame interpolation processing needs to be performed on the video, so that the input frame rate of the video after frame interpolation processing is equal to or close to the preset frame rate.

When the frame interpolation or frame polishing processing is carried out on the decoded frame of the video, the server side carries out the frame interpolation or frame polishing processing according to the motion complexity of the video.

In a specific example, the frame interpolation or frame dropping process performed on the decoded frame according to the preset frame rate and the motion complexity in S104 specifically includes the following steps, as shown in fig. 2:

s1041: and if the input frame rate is greater than the preset frame rate, performing frame polishing processing on the decoded frame, wherein the frame polishing processing is preferentially performed on the decoded frame with the motion complexity smaller than or equal to a first preset value.

S1042: and if the input frame rate is less than the preset frame rate, performing frame interpolation on the decoded frame, wherein the frame interpolation is preferentially performed on the decoded frame with the motion complexity greater than or equal to a second preset value.

Alternatively, the motion complexity may be expressed by a normalized numerical value, for example, the motion complexity is expressed by a numerical value between 0 and 1, and then the motion complexity is classified into different categories according to the difference of the numerical value, for example, [0, 0.1] represents motion still, (0.1, 0.5] represents motion smoothness, (0.5, 0.8] represents motion complexity, (0.8, 1.0] represents motion complexity, and the categories corresponding to different numerical values may be set according to actual needs, which is not limited specifically here.

In S1041, the first preset value may be set according to an actual situation, for example, may be set to 0.5, and is not limited specifically here.

Specifically, the server compares the input frame rate of the video with a preset frame rate, and if the input frame rate of the video is greater than the preset frame rate, the server performs frame throwing processing on the decoded frames of the video, and preferentially throws the decoded frames with the motion complexity less than or equal to a first preset value when performing the frame throwing processing. It should be understood that discarding the decoded frames with the motion complexity less than or equal to the first preset value preferentially means discarding at least some of the decoded frames with the motion complexity less than or equal to the first preset value, so that the input frame rate of the video is the same as or close to the preset frame rate; if all the decoding frames smaller than or equal to the first preset value are discarded and the input frame rate of the video is not the same as or close to the preset frame rate, continuing to perform frame discarding processing on the decoding frames with the motion complexity larger than or equal to the first preset value. Optionally, when frame polishing processing is performed on a decoded frame with a motion complexity less than or equal to a first preset value, further subdivision may be performed according to the motion complexity of the decoded frame, so that the decoded frame with a lower motion complexity is subjected to frame polishing processing preferentially; the frame polishing process may be performed according to the sequence of the decoded frames without further subdivision, and may be specifically set according to actual needs, which is not limited herein. Optionally, when the frame dropping process is continuously performed on the decoded frames with the motion complexity greater than or equal to the first preset value, a third preset value may be further set, so that the frames with the motion complexity greater than or equal to the third preset value are retained as much as possible, and the frames with the motion complexity less than or equal to the third preset value are preferentially dropped. The third preset value may be set according to actual needs, for example, set to 0.8, and is not limited specifically here.

In S1042, the second preset value may be set according to actual needs, for example, set to 0.1, which is not limited herein.

Specifically, the server compares the input frame rate of the video with a preset frame rate, and performs frame interpolation on the decoded frame if the input frame rate of the video is less than the preset frame rate. And when the frame interpolation is carried out, preferentially interpolating the decoded frame with the motion complexity larger than or equal to a second preset value. It should be understood that, preferentially performing frame interpolation on decoded frames with motion complexity greater than or equal to a second preset value means performing frame interpolation on at least some decoded frames with motion complexity greater than or equal to the second preset value, so that the input frame rate of the video is the same as or close to the preset frame rate; if the frame interpolation processing is carried out on the decoded frame with the motion complexity being larger than or equal to the second preset value, and the input frame rate of the video is not the same as or close to the preset frame rate, the frame interpolation processing is continuously carried out on the decoded frame with the motion complexity being smaller than the second preset value. Optionally, when the decoded frame with the motion complexity greater than or equal to the second preset value is subjected to frame interpolation processing, the decoded frame with the higher motion complexity may be further subdivided according to the motion complexity of the decoded frame, so that the decoded frame with the higher motion complexity is subjected to frame interpolation processing preferentially, or may not be further subdivided, and the frame interpolation processing is performed according to the sequence of the decoded frames, which may be specifically set according to actual needs, and is not limited herein. It should be noted that, when the frame interpolation processing is performed on the decoded frame with the motion complexity greater than or equal to the second preset value, the interpolated frame inserted between the previous frame and the next frame of the decoded frame with the motion complexity greater than or equal to the second preset value is calculated according to the previous frame or the next frame, and then the interpolated frame is inserted. When the insertion frame is calculated, the insertion frame can be calculated according to some characteristic parameters (such as a macro block motion vector) of the video, so that the display effect of the video after the insertion frame is inserted is smoother. Optionally, when the frame interpolation processing is continuously performed on the decoded frame with the motion complexity smaller than the second preset value, the decoded frame with the motion complexity smaller than the second preset value may be simply copied to reduce the amount of calculation, and may also be obtained by using the above calculation method, which is not limited specifically here.

It should be noted that, when performing frame dropping or frame interpolation processing on a video with an input frame rate different from the preset frame rate to make the input frame rate of the video equal to or close to the preset frame rate, the number of frames to be dropped or interpolated may be determined according to a difference between the input frame rate of the video and the preset frame rate. For example, if the input frame rate of the video is 27 and the preset frame rate is 25, the number of frames that should be discarded is 27-25 — 2 frames.

S105: and synthesizing the mixed screen video according to the processed decoded frames.

The screen mixing refers to that videos to be mixed are combined in one screen.

Specifically, the server mixes the decoded frames subjected to frame dropping or frame inserting processing, synthesizes a mixed-screen video, encodes and outputs the mixed-screen video. It should be understood that, since there may be a video with the same input frame rate as the preset frame rate, and the video does not need to be processed by frame dropping or frame interpolation, if there is a video with the same input frame rate as the preset frame rate, the server mixes the processed decoded frame with the video, synthesizes the mixed-screen video, and outputs the synthesized mixed-screen video.

Compared with the prior art, the multi-party video screen mixing method provided by the embodiment of the invention has the advantages that the input frame rate of the processed video is the same as or close to the preset frame rate by performing frame interpolation or frame polishing processing on the video with the input frame rate different from the preset frame rate, so that the frame rates of the videos after screen mixing are the same as or close to each other, and the synchronism of the videos after screen mixing is improved; meanwhile, the motion complexity of the video is obtained through the motion complexity model, and when the video is subjected to frame interpolation or frame polishing, the frame interpolation or frame polishing is carried out in combination with the motion complexity of the video, so that the inserted frames or the dropped frames are more reasonable, the video after screen mixing is more in line with the real situation, and the display effect of the communication video of the screen mixing is improved.

In a specific example, before S103, that is, before inputting the feature parameters into the motion complexity model, as shown in fig. 3, the method further includes the following steps:

s201: and acquiring a video training sample, and extracting the characteristic parameters of the video training sample.

S202: and inputting the characteristic parameters of the video training samples into a deep learning model for training.

S203: and taking the trained deep learning model as a motion complexity model.

In S201, the video training sample may be composed of video data sources collected in actual call processes in various scenes, or may be a video recorded specifically according to training requirements, which is not limited herein.

In S202, the deep learning model is a learning model constructed based on a deep neural network, and is used for learning the operation information of the video, so as to obtain the motion complexity of the video. Alternatively, the deep learning model may be a general open source deep learning framework, such as tensoflow, Torch, or Caffe, that is, the feature parameters of the video training book are input into the open source deep learning framework for training.

Specifically, the server side obtains a video training sample, performs coding and decoding operations on the video training sample, extracts characteristic parameters of the video training sample, inputs the characteristic parameters of the video training sample into a deep learning model for training, and takes the deep learning model after training as a motion complexity model after the deep learning model is trained to a preset precision.

The video training samples are input into the deep learning model for training to obtain the motion complexity model, so that the motion complexity of videos with different input frame rates and preset frame rates can be obtained according to the motion complexity model, and when the videos are subjected to frame interpolation or frame polishing, the motion complexity of the videos is combined for frame interpolation or frame polishing, so that the mixed-screen videos can better accord with the actual motion conditions, and the display synchronism and smoothness of the mixed-screen videos are improved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the steps contain the same logical relationship, which is within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second embodiment of the present invention relates to a multi-party video mixing device, as shown in fig. 4, including: an encoded frame acquisition module 301, a parameter acquisition module 302, a complexity acquisition module 303, a frame processing module 304, and a video composition module 305.

The encoding frame acquiring module 301 is configured to acquire encoding frames of N videos to be mixed and an input frame rate of each of the videos, where N is a natural number greater than 1;

a parameter obtaining module 302, configured to decode the encoded frames of the video with the input frame rate different from the preset frame rate to obtain decoded frames, and obtain characteristic parameters of the video with the input frame rate different from the preset frame rate;

a complexity obtaining module 303, configured to input the feature parameter to a motion complexity model, so as to obtain a motion complexity of a video with the input frame rate different from a preset frame rate;

a frame processing module 304, configured to perform frame interpolation or frame polishing on the decoded frame according to the preset frame rate and the motion complexity to obtain a processed decoded frame;

and a video synthesizing module 305, configured to synthesize a mixed-screen video according to the processed decoded frames.

Further, the frame processing module 304 is further configured to:

if the input frame rate is greater than the preset frame rate, performing frame polishing processing on the decoded frame, wherein the frame polishing processing is performed on the decoded frame with the motion complexity less than or equal to a first preset value preferentially;

and if the input frame rate is less than the preset frame rate, performing frame interpolation on the decoded frame, wherein the frame interpolation is preferentially performed on the decoded frame with the motion complexity greater than or equal to a second preset value.

Further, the parameter obtaining module 302 is further configured to: and acquiring a macro block motion vector, a macro block coding type and a frame-level global motion vector of a video with an input frame rate different from a preset frame rate as characteristic parameters.

Further, the multi-party video mixing device provided by the embodiment of the present invention further includes a model determining module, wherein the model determining module is configured to:

acquiring a video training sample, and extracting characteristic parameters of the video training sample;

inputting the characteristic parameters of the video training samples into a deep learning model for training;

and taking the trained deep learning model as the motion complexity model.

Further, the model determination module is further configured to: and inputting the characteristic parameters of the video training samples into an open source deep learning framework for training.

Further, the coded frame acquiring module 301 is further configured to: and respectively acquiring the input frame rate of each video by adopting Kalman filtering.

Further, the video composition module 305 is further configured to: and synthesizing the mixed screen video according to the processed decoded frame and the coded frame of the video with the input frame rate being the same as the preset frame rate.

It should be understood that this embodiment is an example of the apparatus corresponding to the first embodiment, and may be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A third embodiment of the invention is directed to a network device, as shown in fig. 5, comprising at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401 to enable the at least one processor 401 to execute the multi-party video mixing method.

Where the memory 402 and the processor 401 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 401 and the memory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 401 may be transmitted over a wireless medium via an antenna, which may receive the data and transmit the data to the processor 401.

The processor 401 is responsible for managing the bus and general processing and may provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 402 may be used to store data used by processor 401 in performing operations.

A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A multi-party video screen mixing method is characterized by comprising the following steps:

acquiring coding frames of N videos to be mixed and an input frame rate of each video, wherein N is a natural number greater than 1;

decoding the coded frames of the video with the input frame rate different from the preset frame rate to obtain decoded frames, and acquiring characteristic parameters of the video with the input frame rate different from the preset frame rate;

inputting the characteristic parameters into a motion complexity model to obtain the motion complexity of the video with the input frame rate different from a preset frame rate;

performing frame interpolation or frame polishing processing on the decoded frame according to the preset frame rate and the motion complexity to obtain a processed decoded frame;

and synthesizing the mixed screen video according to the processed decoded frame.

2. The multi-party video mixing method according to claim 1, wherein the frame interpolation or frame dropping processing on the decoded frame according to the preset frame rate and the motion complexity comprises:

3. The multi-party video mixing method according to claim 1, wherein the obtaining of the characteristic parameters of the video with the input frame rate different from the preset frame rate specifically comprises:

and acquiring a macro block motion vector, a macro block coding type and a frame-level global motion vector of a video with an input frame rate different from a preset frame rate as characteristic parameters.

4. The multi-party video mixing method of claim 1, wherein before inputting the feature parameters into a motion complexity model, further comprising:

and taking the trained deep learning model as the motion complexity model.

5. The multi-party video mixing method according to claim 4, wherein the feature parameters of the video training samples are input into a deep learning model for training, specifically:

and inputting the characteristic parameters of the video training samples into an open source deep learning framework for training.

6. The multi-party video mixing method according to claim 1, wherein said obtaining encoded frames of N videos to be mixed and an input frame rate of each of the videos comprises:

and respectively acquiring the input frame rate of each video by adopting Kalman filtering.

7. The multi-party video mixing method according to claim 1, wherein said synthesizing of the mixed screen video according to the processed decoded frames comprises:

and synthesizing the mixed screen video according to the processed decoded frame and the coded frame of the video with the input frame rate being the same as the preset frame rate.

8. A multi-party video mixing device, comprising:

the device comprises a coded frame acquisition module, a frame mixing module and a frame mixing module, wherein the coded frame acquisition module is used for acquiring coded frames of N videos to be mixed and an input frame rate of each video, and N is a natural number greater than 1;

the parameter acquisition module is used for decoding the coded frames of the video with the input frame rate different from the preset frame rate to obtain decoded frames and acquiring the characteristic parameters of the video with the input frame rate different from the preset frame rate;

the complexity obtaining module is used for inputting the characteristic parameters into a motion complexity model to obtain the motion complexity of the video with the input frame rate different from a preset frame rate;

the frame processing module is used for performing frame interpolation or frame polishing processing on the decoded frame according to the preset frame rate and the motion complexity to obtain a processed decoded frame;

and the video synthesis module is used for synthesizing the mixed screen video according to the processed decoded frame.

9. A network device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multi-party video mixing method of any one of claims 1-7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the multi-party video mixing method according to any one of claims 1 to 7.