CN113923398A - Video conference realization method and device - Google Patents

Video conference realization method and device Download PDF

Info

Publication number
CN113923398A
CN113923398A CN202111160016.5A CN202111160016A CN113923398A CN 113923398 A CN113923398 A CN 113923398A CN 202111160016 A CN202111160016 A CN 202111160016A CN 113923398 A CN113923398 A CN 113923398A
Authority
CN
China
Prior art keywords
frame image
code rate
video
video conference
salient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111160016.5A
Other languages
Chinese (zh)
Inventor
王乐天
汤仲喆
段小燕
孙孟雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202111160016.5A priority Critical patent/CN113923398A/en
Publication of CN113923398A publication Critical patent/CN113923398A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/25Flow control; Congestion control with rate being modified by the source upon detecting a change of network conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/48Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using compressed domain processing techniques other than decoding, e.g. modification of transform coefficients, variable length coding [VLC] data or run-length data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/543Motion estimation other than block-based using regions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application provides a video conference realization method and a device, which relate to the field of artificial intelligence and can also be used in the field of finance, and comprise the following steps: determining a salient region of each video frame image by using a pre-trained salient object detection model; grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image; and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference. The method and the device can improve the video conference experience under the condition of low bandwidth, adjust the video data transmission quantity in real time, reduce network congestion and highlight key contents.

Description

Video conference realization method and device
Technical Field
The application relates to the field of artificial intelligence, can be used in the field of finance, and particularly relates to a video conference implementation method and device.
Background
With the development of the internet, the video conference technology is more mature, and compared with an offline conference, the video conference can be free from space limitation, and more efficient real-time communication is realized. However, the video conference still has some technical problems, which affect the actual communication effect of people, for example, when the network request amount is too large and the bandwidth is insufficient, the phenomena of unsynchronized sound and picture and unsmooth picture are easy to occur.
In view of the above phenomena, the skilled person realizes that only some images of the foreground region in the conference video, such as human face and slide contents, must be displayed. Most background area images cannot bring effective information to the participants, and a large amount of bandwidth is wasted. Worse still, the image display proportion occupied by the background area image is often much larger than that occupied by the significant foreground area image. How to segment the above-mentioned significant foreground region image from the background region image, thereby reducing the unnecessary data transmission amount, is called a technical problem to be solved.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a video conference realization method and device, which can improve the video conference experience under the condition of low bandwidth, adjust the transmission quantity of video data in real time, reduce network congestion and highlight key contents.
In order to solve the technical problem, the application provides the following technical scheme:
in a first aspect, the present application provides a method for implementing a video conference, including:
determining a salient region of each video frame image by using a pre-trained salient object detection model;
grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
Further, the step of training the salient object detection model in advance includes:
learning the image characteristics of the historical video frame images by using a residual error network model; the image features comprise contrast features, brightness features and pixel features;
updating neuron parameters in the residual error network model according to the learned image characteristics;
and performing semantic dimension learning according to the updated neuron parameters, and improving the detection precision of the residual error network model to obtain the significant object detection model.
Further, the determining the salient region of each video frame image by using the pre-trained salient object detection model includes:
and carrying out binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
Further, the grouping each of the video frame images, and selecting the video frame image with the largest salient region from each of the groups as a key frame image, includes:
respectively determining the number of significant pixel points in a significant region of each video frame image;
and selecting the video frame image containing the most significant pixel points from the grouping as the key frame image.
Further, the transmitting each key frame image according to the current maximum actual code rate of the network to complete the video conference includes:
determining the current maximum actual code rate by using a congestion control algorithm;
determining the transmission code rate of a significant region and the transmission code rate of a non-significant region in each key frame image according to the current maximum actual code rate;
and transmitting the salient region according to the transmission code rate of the salient region, and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.
In a second aspect, the present application provides a video conference implementing apparatus, including:
the salient region determining unit is used for determining the salient region of each video frame image by utilizing a pre-trained salient object detection model;
the key frame selecting unit is used for grouping the video frame images and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
and the video conference realization unit is used for transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
Further, the video conference implementation apparatus further includes:
the image characteristic learning unit is used for learning the image characteristics of the historical video frame images by using the residual error network model; the image features comprise contrast features, brightness features and pixel features;
the parameter updating unit is used for updating neuron parameters in the residual error network model according to the learned image characteristics;
and the model generation unit is used for performing semantic dimension learning according to the updated neuron parameters, improving the detection precision of the residual error network model and obtaining the significant object detection model.
Further, the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
Further, the key frame selecting unit includes:
the pixel point number determining module is used for respectively determining the number of the significant pixel points in the significant region of each video frame image;
and the key frame selecting module is used for selecting the video frame image containing the most significant pixel points from the groups as the key frame image.
Further, the video conference implementation unit includes:
an actual code rate determining module, configured to determine the current maximum actual code rate by using a congestion control algorithm;
a transmission code rate determining module, configured to determine, according to the current maximum actual code rate, a transmission code rate of a significant region and a transmission code rate of a non-significant region in each key frame image;
and the transmission module is used for transmitting the salient region according to the transmission code rate of the salient region and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video conference implementation method when executing the program.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the video conference implementation method.
Aiming at the problems in the prior art, the video conference realization method and the device provided by the application can utilize the obvious object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission quantity in real time, reduce network congestion and highlight key contents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a video conference implementation method in an embodiment of the present application;
FIG. 2 is a flowchart of the steps for training a salient object detection model in an embodiment of the present application;
FIG. 3 is a flow chart of determining a key frame image in an embodiment of the present application;
FIG. 4 is a flow chart of completing a video conference in an embodiment of the present application;
fig. 5 is one of the structural diagrams of the video conference realization apparatus in the embodiment of the present application;
fig. 6 is a second block diagram of a video conference realizing apparatus in the embodiment of the present application;
FIG. 7 is a block diagram of a key frame selection unit according to an embodiment of the present disclosure;
fig. 8 is a structural diagram of a video conference implementing unit in the embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application;
fig. 10 is a schematic diagram of a salient object detection algorithm model in the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
When a video conference is carried out, when the network resource request amount is too large and the actual bandwidth is insufficient, the phenomena of unsynchronized sound and pictures and unsmooth pictures are easy to occur. In view of the above phenomena, the skilled person realizes that only some images of the foreground region in the conference video, such as human face and slide contents, must be displayed. Most background area images cannot bring effective information to the participants, and a large amount of bandwidth is wasted. Worse still, the image display proportion occupied by the background area image is often much larger than that occupied by the significant foreground area image. How to segment the above-mentioned significant foreground region image from the background region image, thereby reducing the unnecessary data transmission amount, is called a technical problem to be solved.
The human visual system possesses an important mechanism for processing visual information so that the human eye usually locates only the region of interest to the person, i.e. the visual attention mechanism. When people observe a life scene, a vision system always subconsciously guides eyeballs to rotate, and an interested area is preferentially processed, which is the embodiment of a vision attention mechanism. By utilizing a visual attention mechanism, the human visual system can process the regions in different scenes respectively and extract the foreground and the background, so that the visual data amount required to be processed is greatly reduced, and the brain calculation consumption is reduced. The salient object detection algorithm is used for simulating a human visual attention mechanism and calculating the degree of attracting visual attention of people by each part in the image so as to efficiently process image information. The image can be locally compressed by utilizing the obvious object detection technology, the data transmission amount is adjusted, and the problem of image blockage is solved.
It should be noted that the video conference implementation method and apparatus provided by the present application may be used in the financial field, and may also be used in any field other than the financial field.
In an embodiment, referring to fig. 1, in order to improve the video conference experience under the low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important content, the present application provides a video conference implementation method, including:
s101: determining a salient region of each video frame image by using a pre-trained salient object detection model;
s102: grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
s103: and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
It can be understood that the embodiment of the present application may divide the video conference data into a plurality of video frame images, and then determine, by using a pre-trained salient object detection model, a region that is most interesting to people and needs to be transmitted most in each video frame image, that is, a salient region. Wherein the role of the salient object detection model at least comprises determining salient regions of the video frame images. The model needs to be obtained by pre-training based on historical video frame images, and the specific training process is explained in detail below.
After the salient regions of the video frame images are determined, the video frame images may be grouped, i.e., the video conference data stream may be divided into equal portions. For example, the frame may be a group of five frames or ten frames, but the application is not limited thereto. And respectively selecting the video frame image with the largest salient region from each group as a key frame image. The purpose of this is to be understood as that one frame which is most interesting to people and needs to be transmitted is selected from a plurality of adjacent video frame images for transmission, so that the transmission data volume is reduced, and the transmission efficiency is improved.
And finally, determining the current maximum actual code rate of the network for transmitting the video frame images by using a congestion control algorithm so as to transmit each key frame image according to the current maximum actual code rate of the network and finish the video conference.
From the above description, the video conference implementation method provided by the application can utilize the salient object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission amount in real time, reduce network congestion, and highlight key contents.
In one embodiment, referring to fig. 2, the step of training the salient object detection model in advance includes:
s201: learning the image characteristics of the historical video frame images by using a residual error network model; the image features comprise contrast features, brightness features and pixel features;
s202: updating neuron parameters in the residual error network model according to the learned image characteristics;
s203: and performing semantic dimension learning according to the updated neuron parameters, and improving the detection precision of the residual error network model to obtain the significant object detection model.
It can be understood that in the video conference, in order to transmit the video conference data efficiently, the video conference data needs to be subjected to data compression. In the embodiment of the application, compression is realized by selecting key frame images. The selection of the key frame image can be understood as the extraction of the video frame image. Conventionally, the image extraction method extracts 1 frame from every N frames in the order in which the video frame images appear in the video conference. However, this image extraction method can only achieve basic data compression, and cannot guarantee that the extracted image is the most meaningful frame of the N frames. Therefore, the embodiment of the present application needs to train a salient object detection model to be able to select 1 frame of most interest from the N frames of video frame images as a key frame image.
The specific training steps are as follows:
firstly, a residual error network model is used for learning the image characteristics of historical video frame images in advance, and the training speed is improved.
The convolutional neural network mainly comprises a convolutional layer, a pooling layer, an activation function layer and a full-connection layer. The convolution layer is responsible for performing convolution operation by utilizing a convolution kernel so as to extract data features. The pooling layer is responsible for data compression of input data, and parameter quantity is reduced. The activation function layer may increase the ability of the entire convolutional neural network to fit complex non-linear cases. The fully-connected layer can map input data into different categories and finally output predicted values of the network. The residual error network model transfers the front layer to the rear layer through identity mapping in a near-path connection mode, which is equivalent to a shallow network and identity mapping function. Thus, only learning the residual function achieves this part of the training goal. The method is not required to add extra computational burden to the network with the deep layer number, so the method is often used in image processing and other scenes.
In order to learn the image characteristics of historical video frame images in advance, the time required for model training is reduced. The present examined embodiment uses parameters pre-trained with a large number of data sets as initial parameters for model training.
Combining the characteristics of different layers through a multi-scale fusion module, and forcing the neuron parameters in the neural network to have larger variation amplitude.
The jump connection of the multi-scale fusion module can repeatedly utilize the characteristics of the previous layer, the gradient back propagation capacity is improved, the previous layer can receive additional supervision from a loss function through a short distance, and the convolutional neural network is easier to train. In order to increase the receptive field, the embodiment of the application adopts four asymmetric convolution kernels with different scales, the effect of the asymmetric convolution can approximate to the square convolution, and meanwhile, the parameter quantity is greatly reduced. It should be noted that not all parameters in the square convolution kernel have the same significance. The parameters at the center crossing position are more important, while the parameters at the corner positions are less important. The parameters of the center crossing location can be enhanced using a one-dimensional convolution of horizontal and vertical. Neuron parameters in the residual network model can be updated according to the learned image features.
And thirdly, learning a high-dimensional space feature vector of semantic dimensionality by decomposing visual features.
The quality of the image can be judged by various dimensions. Among them, the preference for considering human visual perception is the luminance characteristics, the contrast characteristics and the structural characteristics.
The luminance characteristics may be expressed as:
Figure BDA0003289692320000071
the contrast characteristic can be expressed as:
Figure BDA0003289692320000072
the structural features can be expressed as:
Figure BDA0003289692320000073
the embodiment of the application introduces the structural similarity integrating three characteristics:
Figure BDA0003289692320000074
wherein, muxIs the average value of x, μyIs the average value of y and is,
Figure BDA0003289692320000075
is the variance of x and is,
Figure BDA0003289692320000076
is the variance of y, c1,c2,c3Is constant in order to avoid a denominator of 0.
In order to make the salient object more accurate, the embodiment of the application also adopts an intersection ratio loss function. The function is commonly used in target detection scenes and is used for measuring the accuracy of detection frames, so that the coincidence degree between a predicted object and a real object is compared, and wrong boundaries are punished.
The cross-over loss function is expressed as follows:
Figure BDA0003289692320000077
the objective function of the model is defined as the sum of the structural similarity loss function and the intersection-to-intersection ratio loss function. Finally, the salient object detection algorithm model is shown in fig. 10.
From the above description, the video conference implementation method provided by the application can train a salient object detection model.
In one embodiment, determining the salient region of each video frame image by using a pre-trained salient object detection model includes: and carrying out binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
In an embodiment, referring to fig. 3, grouping each of the video frame images, and selecting a video frame image with a largest salient region from each of the groups as a key frame image includes:
s301: respectively determining the number of significant pixel points in a significant region of each video frame image;
s302: and selecting the video frame image containing the most significant pixel points from the grouping as the key frame image.
It can be understood that the embodiment of the present application may utilize a salient object detection algorithm to binarize the video frame image according to the pixel points, and then calculate the number of salient pixel points of each video frame image. And selecting the image containing the most significant pixel points in the adjacent frames as a key frame image.
It should be noted that adjacent frames may be understood as frames around a certain frame, and assuming that the grouping rule is that every N frames are grouped, the adjacent frame of a certain frame and the frame should belong to the same group. In addition, for the same video frame image, the more the significant pixel points are, the larger the significant area of the video frame image is, and the more important the video frame image is. Therefore, in the embodiment of the present application, the video frame image with the largest number of significant pixels needs to be used as the key frame image.
In the embodiment of the application, firstly, binarization can be performed on an image according to pixel points by using a salient object detection algorithm, that is, the pixel value of a salient pixel point is set to be 255, and the pixel value of a non-salient pixel point is set to be 0; and then counting the number of the significant pixel points in each video frame image. Assuming that N is 5, the image containing the most significant pixel points in the 5 video frame images can be selected as the key frame image of the group.
As can be seen from the above description, the video conference implementation method provided by the present application can group the video frame images, and respectively select the video frame image with the largest salient region from the groups as the key frame image.
In an embodiment, referring to fig. 4, the transmitting each key frame image according to the current maximum actual bitrate of the network to complete the video conference includes:
s401: determining the current maximum actual code rate by using a congestion control algorithm;
s402: determining the transmission code rate of a significant region and the transmission code rate of a non-significant region in each key frame image according to the current maximum actual code rate;
s403: and transmitting the salient region according to the transmission code rate of the salient region, and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.
It can be understood that, in the embodiment of the present application, a network traffic monitoring system may be used to implement actual code rate acquisition on a current network.
The network flow monitoring system adopts a mainstream congestion control algorithm Sendside-BWE to estimate the bandwidth. All code rate calculation modules are moved to a transmitting end to be carried out, and a Trendline filter is adopted to replace a Kalman filter in the prior art. The actual measurement shows that the algorithm can better and more quickly estimate the code rate.
The network flow monitoring system adopts a congestion control algorithm Sendside-BWE adopted by the mainstream Google to estimate the bandwidth. Specifically, 1) when a sending end sends an RTP data packet, a transmission layer sequence number is set in an RTP header extension; 2) recording the sequence number and the packet arrival time after the data packet arrives at the receiving end, and then returning the constructed message to the transmitting end by the receiving end; 3) the sending end analyzes the message, executes a Sendside-BWE algorithm, and calculates to obtain a delay-based code rate Ar; 4) and finally, comparing the Ar with the code rate As based on the packet loss rate to obtain the final target code rate, and forming a complete loop. The actual measurement shows that the algorithm can better and more quickly realize code rate estimation.
Specifically, in the embodiment of the present application, after the current maximum actual code rate is determined by using the congestion control algorithm, the amount of data that can be used by the current network can be determined according to the actual code rate. Different encoding parameters are set for different regions (salient regions and non-salient regions) of the key frame image, and different encoding methods are used.
It can be understood that, in the transmission process of the video conference data, the usable data volume (network bandwidth) is limited, and in consideration of the difference in significance degree of different regions, the low-bit-rate coding can be performed on the non-significant region, the high-bit-rate coding can be performed on the significant region, and then the clear video image can be transmitted as much as possible under the condition of the limited data volume.
The low-rate coding is realized by combining global motion compensation and local motion compensation. Since global motion compensated coding is suitable for background region coding and local motion compensated coding is suitable for local complex region coding. Firstly, performing global motion estimation on an input video frame image to obtain global motion model parameters; and then calculating the position of a reference point, and carrying out differential coding on the motion vector of the reference point to obtain a reference point coding stream. And at the encoder end, decoding the encoded current frame, storing the decoded reconstructed image in a buffer area, and changing the current decoded image according to the obtained global motion estimation parameters so as to be used for global motion compensation of the next frame.
And the high-rate coding is carried out by adopting an H265 coding mode. The H265 encoding mode adds block division based on the quad-tree based on the traditional encoding technology and adopts the variable size DCT transform and other technologies, thereby realizing higher frame rate, higher compression rate and higher definition.
If the available data volume (network bandwidth) is not enough to transmit the data volume of the current key frame image, coding the significant region with a high code rate, distributing a certain proportion (for example, 0.8 times) of total data volume, coding the non-significant region with a low code rate, and distributing a certain proportion (for example, 0.2 times) of total data volume. And if the available flow is larger than the data volume of the current key frame image, encoding by using the normal code rate. The coded data can be put into a sending cache, and a video conference data sending thread is started to send video conference data; the video conference terminal can receive the video data, decode the video data and play video conference images to complete the video conference.
From the above description, the video conference implementation method provided by the application can transmit each key frame image according to the current maximum actual code rate of the network, so as to complete the video conference.
Based on the same inventive concept, the embodiment of the present application further provides a video conference implementation apparatus, which can be used to implement the methods described in the foregoing embodiments, as described in the following embodiments. Because the principle of the video conference realization device for solving the problems is similar to that of the video conference realization method, the implementation of the video conference realization device can refer to the implementation of the software performance reference determination method, and repeated parts are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
In an embodiment, referring to fig. 5, in order to improve the video conference experience under the low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important content, the present application provides a video conference implementing apparatus, including: a salient region determining unit 501, a key frame selecting unit 502 and a video conference realizing unit 503.
A salient region determining unit 501, configured to determine a salient region of each video frame image by using a pre-trained salient object detection model;
a key frame selecting unit 502, configured to group each of the video frame images, and select a video frame image with a largest salient region from each group as a key frame image;
and a video conference implementing unit 503, configured to transmit each key frame image according to the current maximum actual code rate of the network, so as to complete a video conference.
In an embodiment, referring to fig. 6, the video conference implementing apparatus further includes: image feature learning section 601, parameter updating section 602, and model generating section 603.
An image feature learning unit 601 configured to learn image features of historical video frame images using a residual network model; the image features comprise contrast features, brightness features and pixel features;
a parameter updating unit 602, configured to update neuron parameters in the residual network model according to the learned image features;
and the model generating unit 603 is configured to perform semantic dimension learning according to the updated neuron parameters, improve the detection precision of the residual error network model, and obtain the significant object detection model.
In an embodiment, the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
In an embodiment, referring to fig. 7, the key frame selecting unit 502 includes: a pixel point number determining module 701 and a key frame selecting module 702.
A pixel point number determining module 701, configured to determine the number of salient pixel points in a salient region of each video frame image respectively;
a key frame selecting module 702, configured to select, from the group, a video frame image with the largest number of significant pixels as the key frame image.
In an embodiment, referring to fig. 8, the video conference implementing unit 503 includes:
an actual code rate determining module 801, configured to determine the current maximum actual code rate by using a congestion control algorithm;
a transmission code rate determining module 802, configured to determine, according to the current maximum actual code rate, a transmission code rate of a significant region and a transmission code rate of a non-significant region in each key frame image;
a transmission module 803, configured to transmit the significant region according to the transmission code rate of the significant region, and transmit the insignificant region according to the transmission code rate of the insignificant region, so as to complete a video conference.
In terms of hardware, in order to improve the video conference experience under a low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important contents, the present application provides an embodiment of an electronic device for implementing all or part of the contents in the video conference implementation method, where the electronic device specifically includes the following contents:
a Processor (Processor), a Memory (Memory), a communication Interface (Communications Interface) and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the video conference realization device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may be implemented with reference to the embodiment of the video conference implementation method and the embodiment of the video conference implementation apparatus in the embodiment, and the contents thereof are incorporated herein, and repeated descriptions are omitted.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the video conference implementation method may be executed on the electronic device side as described in the above, or all operations may be completed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be in communication connection with a remote server to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 9, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the video conference implementation functionality may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
s101: determining a salient region of each video frame image by using a pre-trained salient object detection model;
s102: grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
s103: and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
From the above description, the video conference implementation method provided by the application can utilize the salient object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission amount in real time, reduce network congestion, and highlight key contents.
In another embodiment, the video conference realizing apparatus may be configured separately from the central processor 9100, for example, the video conference realizing apparatus of the data composite transmission apparatus may be configured as a chip connected to the central processor 9100, and the function of the video conference realizing method is realized by the control of the central processor.
As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, which may be referred to in the prior art.
As shown in fig. 9, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless lan module, may be disposed in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the video conference implementation method with the execution subject being the server or the client in the foregoing embodiments, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the video conference implementation method with the execution subject being the server or the client in the foregoing embodiments, for example, when the processor executes the computer program, the processor implements the following steps:
s101: determining a salient region of each video frame image by using a pre-trained salient object detection model;
s102: grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
s103: and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
From the above description, the video conference implementation method provided by the application can utilize the salient object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission amount in real time, reduce network congestion, and highlight key contents.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. A video conference implementation method is characterized by comprising the following steps:
determining a salient region of each video frame image by using a pre-trained salient object detection model;
grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
2. The method of claim 1, wherein the step of pre-training the salient object detection model comprises:
learning the image characteristics of the historical video frame images by using a residual error network model; the image features comprise contrast features, brightness features and pixel features;
updating neuron parameters in the residual error network model according to the learned image characteristics;
and performing semantic dimension learning according to the updated neuron parameters, and improving the detection precision of the residual error network model to obtain the significant object detection model.
3. The method of claim 1, wherein the determining the salient region of each video frame image by using a pre-trained salient object detection model comprises:
and carrying out binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
4. The method of claim 1, wherein the grouping the video frame images and selecting the video frame image with the largest salient region from the groups as a key frame image comprises:
respectively determining the number of significant pixel points in a significant region of each video frame image;
and selecting the video frame image containing the most significant pixel points from the grouping as the key frame image.
5. The method of claim 1, wherein the transmitting each key frame image according to the current maximum actual bitrate of the network to complete the video conference comprises:
determining the current maximum actual code rate by using a congestion control algorithm;
determining the transmission code rate of a significant region and the transmission code rate of a non-significant region in each key frame image according to the current maximum actual code rate;
and transmitting the salient region according to the transmission code rate of the salient region, and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.
6. A video conference effectuating apparatus, comprising:
the salient region determining unit is used for determining the salient region of each video frame image by utilizing a pre-trained salient object detection model;
the key frame selecting unit is used for grouping the video frame images and respectively selecting the video frame image with the largest salient region from each group as a key frame image;
and the video conference realization unit is used for transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.
7. The video conference effectuating apparatus in accordance with claim 6, further comprising:
the image characteristic learning unit is used for learning the image characteristics of the historical video frame images by using the residual error network model; the image features comprise contrast features, brightness features and pixel features;
the parameter updating unit is used for updating neuron parameters in the residual error network model according to the learned image characteristics;
and the model generation unit is used for performing semantic dimension learning according to the updated neuron parameters, improving the detection precision of the residual error network model and obtaining the significant object detection model.
8. The apparatus according to claim 6, wherein the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.
9. The apparatus as claimed in claim 6, wherein the key frame extracting unit comprises:
the pixel point number determining module is used for respectively determining the number of the significant pixel points in the significant region of each video frame image;
and the key frame selecting module is used for selecting the video frame image containing the most significant pixel points from the groups as the key frame image.
10. The apparatus of claim 6, wherein the video conference realization unit comprises:
an actual code rate determining module, configured to determine the current maximum actual code rate by using a congestion control algorithm;
a transmission code rate determining module, configured to determine, according to the current maximum actual code rate, a transmission code rate of a significant region and a transmission code rate of a non-significant region in each key frame image;
and the transmission module is used for transmitting the salient region according to the transmission code rate of the salient region and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video conference implementation method of any one of claims 1 to 5 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the video conference realization method of any one of claims 1 to 5.
CN202111160016.5A 2021-09-30 2021-09-30 Video conference realization method and device Pending CN113923398A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111160016.5A CN113923398A (en) 2021-09-30 2021-09-30 Video conference realization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111160016.5A CN113923398A (en) 2021-09-30 2021-09-30 Video conference realization method and device

Publications (1)

Publication Number Publication Date
CN113923398A true CN113923398A (en) 2022-01-11

Family

ID=79237459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111160016.5A Pending CN113923398A (en) 2021-09-30 2021-09-30 Video conference realization method and device

Country Status (1)

Country Link
CN (1) CN113923398A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103458238A (en) * 2012-11-14 2013-12-18 深圳信息职业技术学院 Scalable video code rate controlling method and device combined with visual perception
CN103873876A (en) * 2014-03-17 2014-06-18 天津大学 Conspicuousness-based multi-viewpoint color plus depth video coding method
CN105049850A (en) * 2015-03-24 2015-11-11 上海大学 HEVC (High Efficiency Video Coding) code rate control method based on region-of-interest
CN106604031A (en) * 2016-11-22 2017-04-26 金华就约我吧网络科技有限公司 Region of interest-based H. 265 video quality improvement method
CN107396108A (en) * 2017-08-15 2017-11-24 西安万像电子科技有限公司 Code rate allocation method and device
CN110602548A (en) * 2019-09-20 2019-12-20 北京市博汇科技股份有限公司 Method and system for high-quality wireless transmission of ultra-high-definition video
CN110708507A (en) * 2019-09-23 2020-01-17 深圳市景阳信息技术有限公司 Monitoring video data transmission method and device and terminal equipment
WO2020112452A1 (en) * 2018-11-30 2020-06-04 Motorola Solutions, Inc. Device, system and method for providing audio summarization data from video
CN113111782A (en) * 2021-04-14 2021-07-13 中国工商银行股份有限公司 Video monitoring method and device based on salient object detection

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103458238A (en) * 2012-11-14 2013-12-18 深圳信息职业技术学院 Scalable video code rate controlling method and device combined with visual perception
CN103873876A (en) * 2014-03-17 2014-06-18 天津大学 Conspicuousness-based multi-viewpoint color plus depth video coding method
CN105049850A (en) * 2015-03-24 2015-11-11 上海大学 HEVC (High Efficiency Video Coding) code rate control method based on region-of-interest
CN106604031A (en) * 2016-11-22 2017-04-26 金华就约我吧网络科技有限公司 Region of interest-based H. 265 video quality improvement method
CN107396108A (en) * 2017-08-15 2017-11-24 西安万像电子科技有限公司 Code rate allocation method and device
WO2020112452A1 (en) * 2018-11-30 2020-06-04 Motorola Solutions, Inc. Device, system and method for providing audio summarization data from video
CN110602548A (en) * 2019-09-20 2019-12-20 北京市博汇科技股份有限公司 Method and system for high-quality wireless transmission of ultra-high-definition video
CN110708507A (en) * 2019-09-23 2020-01-17 深圳市景阳信息技术有限公司 Monitoring video data transmission method and device and terminal equipment
CN113111782A (en) * 2021-04-14 2021-07-13 中国工商银行股份有限公司 Video monitoring method and device based on salient object detection

Similar Documents

Publication Publication Date Title
CN111340711B (en) Super-resolution reconstruction method, device, equipment and storage medium
CN111479112B (en) Video coding method, device, equipment and storage medium
US11025959B2 (en) Probabilistic model to compress images for three-dimensional video
US10491711B2 (en) Adaptive streaming of virtual reality data
US20190246096A1 (en) Behavioral Directional Encoding of Three-Dimensional Video
JP2020010331A (en) Method for improving image quality
CN106576158A (en) Immersive video
CN106170979A (en) Constant Quality video encodes
CN110996131B (en) Video encoding method, video encoding device, computer equipment and storage medium
KR102480709B1 (en) Method and apparatus for determining quality of experience of vr multi-media
CN112040222B (en) Visual saliency prediction method and equipment
CN110969572B (en) Face changing model training method, face exchange device and electronic equipment
WO2022000298A1 (en) Reinforcement learning based rate control
CN115037962A (en) Video adaptive transmission method, device, terminal equipment and storage medium
CN113747160B (en) Video coding configuration method, device, equipment and computer readable storage medium
JP2024511103A (en) Method and apparatus for evaluating the quality of an image or video based on approximate values, method and apparatus for training a first model, electronic equipment, storage medium, and computer program
CN112492323B (en) Live broadcast mask generation method, readable storage medium and computer equipment
CN112165598A (en) Data processing method, device, terminal and storage medium
CN113923398A (en) Video conference realization method and device
CN115499666A (en) Video compression method, video decompression method, video compression device, video decompression device, and storage medium
CN112200816A (en) Method, device and equipment for segmenting region of video image and replacing hair
CN116708793B (en) Video transmission method, device, equipment and storage medium
CN116939254A (en) Video stream transmission method, device, computer equipment and storage medium
CN116010697A (en) Data processing method, electronic device and storage medium
CN116419032A (en) Video playing method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220111

RJ01 Rejection of invention patent application after publication