CN113923398A

CN113923398A - Video conference realization method and device

Info

Publication number: CN113923398A
Application number: CN202111160016.5A
Authority: CN
Inventors: 王乐天; 汤仲喆; 段小燕; 孙孟雷
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-11

Abstract

The application provides a video conference realization method and a device, which relate to the field of artificial intelligence and can also be used in the field of finance, and comprise the following steps: determining a salient region of each video frame image by using a pre-trained salient object detection model; grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image; and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference. The method and the device can improve the video conference experience under the condition of low bandwidth, adjust the video data transmission quantity in real time, reduce network congestion and highlight key contents.

Description

Video conference realization method and device

Technical Field

The application relates to the field of artificial intelligence, can be used in the field of finance, and particularly relates to a video conference implementation method and device.

Background

With the development of the internet, the video conference technology is more mature, and compared with an offline conference, the video conference can be free from space limitation, and more efficient real-time communication is realized. However, the video conference still has some technical problems, which affect the actual communication effect of people, for example, when the network request amount is too large and the bandwidth is insufficient, the phenomena of unsynchronized sound and picture and unsmooth picture are easy to occur.

In view of the above phenomena, the skilled person realizes that only some images of the foreground region in the conference video, such as human face and slide contents, must be displayed. Most background area images cannot bring effective information to the participants, and a large amount of bandwidth is wasted. Worse still, the image display proportion occupied by the background area image is often much larger than that occupied by the significant foreground area image. How to segment the above-mentioned significant foreground region image from the background region image, thereby reducing the unnecessary data transmission amount, is called a technical problem to be solved.

Disclosure of Invention

Aiming at the problems in the prior art, the application provides a video conference realization method and device, which can improve the video conference experience under the condition of low bandwidth, adjust the transmission quantity of video data in real time, reduce network congestion and highlight key contents.

In order to solve the technical problem, the application provides the following technical scheme:

in a first aspect, the present application provides a method for implementing a video conference, including:

determining a salient region of each video frame image by using a pre-trained salient object detection model;

grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;

and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.

Further, the step of training the salient object detection model in advance includes:

learning the image characteristics of the historical video frame images by using a residual error network model; the image features comprise contrast features, brightness features and pixel features;

updating neuron parameters in the residual error network model according to the learned image characteristics;

and performing semantic dimension learning according to the updated neuron parameters, and improving the detection precision of the residual error network model to obtain the significant object detection model.

Further, the determining the salient region of each video frame image by using the pre-trained salient object detection model includes:

and carrying out binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.

Further, the grouping each of the video frame images, and selecting the video frame image with the largest salient region from each of the groups as a key frame image, includes:

respectively determining the number of significant pixel points in a significant region of each video frame image;

and selecting the video frame image containing the most significant pixel points from the grouping as the key frame image.

Further, the transmitting each key frame image according to the current maximum actual code rate of the network to complete the video conference includes:

determining the current maximum actual code rate by using a congestion control algorithm;

determining the transmission code rate of a significant region and the transmission code rate of a non-significant region in each key frame image according to the current maximum actual code rate;

and transmitting the salient region according to the transmission code rate of the salient region, and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.

In a second aspect, the present application provides a video conference implementing apparatus, including:

the salient region determining unit is used for determining the salient region of each video frame image by utilizing a pre-trained salient object detection model;

the key frame selecting unit is used for grouping the video frame images and respectively selecting the video frame image with the largest salient region from each group as a key frame image;

and the video conference realization unit is used for transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.

Further, the video conference implementation apparatus further includes:

the image characteristic learning unit is used for learning the image characteristics of the historical video frame images by using the residual error network model; the image features comprise contrast features, brightness features and pixel features;

the parameter updating unit is used for updating neuron parameters in the residual error network model according to the learned image characteristics;

and the model generation unit is used for performing semantic dimension learning according to the updated neuron parameters, improving the detection precision of the residual error network model and obtaining the significant object detection model.

Further, the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.

Further, the key frame selecting unit includes:

the pixel point number determining module is used for respectively determining the number of the significant pixel points in the significant region of each video frame image;

and the key frame selecting module is used for selecting the video frame image containing the most significant pixel points from the groups as the key frame image.

Further, the video conference implementation unit includes:

an actual code rate determining module, configured to determine the current maximum actual code rate by using a congestion control algorithm;

a transmission code rate determining module, configured to determine, according to the current maximum actual code rate, a transmission code rate of a significant region and a transmission code rate of a non-significant region in each key frame image;

and the transmission module is used for transmitting the salient region according to the transmission code rate of the salient region and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.

In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video conference implementation method when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the video conference implementation method.

Aiming at the problems in the prior art, the video conference realization method and the device provided by the application can utilize the obvious object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission quantity in real time, reduce network congestion and highlight key contents.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video conference implementation method in an embodiment of the present application;

FIG. 2 is a flowchart of the steps for training a salient object detection model in an embodiment of the present application;

FIG. 3 is a flow chart of determining a key frame image in an embodiment of the present application;

FIG. 4 is a flow chart of completing a video conference in an embodiment of the present application;

fig. 5 is one of the structural diagrams of the video conference realization apparatus in the embodiment of the present application;

fig. 6 is a second block diagram of a video conference realizing apparatus in the embodiment of the present application;

FIG. 7 is a block diagram of a key frame selection unit according to an embodiment of the present disclosure;

fig. 8 is a structural diagram of a video conference implementing unit in the embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application;

fig. 10 is a schematic diagram of a salient object detection algorithm model in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

When a video conference is carried out, when the network resource request amount is too large and the actual bandwidth is insufficient, the phenomena of unsynchronized sound and pictures and unsmooth pictures are easy to occur. In view of the above phenomena, the skilled person realizes that only some images of the foreground region in the conference video, such as human face and slide contents, must be displayed. Most background area images cannot bring effective information to the participants, and a large amount of bandwidth is wasted. Worse still, the image display proportion occupied by the background area image is often much larger than that occupied by the significant foreground area image. How to segment the above-mentioned significant foreground region image from the background region image, thereby reducing the unnecessary data transmission amount, is called a technical problem to be solved.

The human visual system possesses an important mechanism for processing visual information so that the human eye usually locates only the region of interest to the person, i.e. the visual attention mechanism. When people observe a life scene, a vision system always subconsciously guides eyeballs to rotate, and an interested area is preferentially processed, which is the embodiment of a vision attention mechanism. By utilizing a visual attention mechanism, the human visual system can process the regions in different scenes respectively and extract the foreground and the background, so that the visual data amount required to be processed is greatly reduced, and the brain calculation consumption is reduced. The salient object detection algorithm is used for simulating a human visual attention mechanism and calculating the degree of attracting visual attention of people by each part in the image so as to efficiently process image information. The image can be locally compressed by utilizing the obvious object detection technology, the data transmission amount is adjusted, and the problem of image blockage is solved.

It should be noted that the video conference implementation method and apparatus provided by the present application may be used in the financial field, and may also be used in any field other than the financial field.

In an embodiment, referring to fig. 1, in order to improve the video conference experience under the low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important content, the present application provides a video conference implementation method, including:

s101: determining a salient region of each video frame image by using a pre-trained salient object detection model;

s102: grouping the video frame images, and respectively selecting the video frame image with the largest salient region from each group as a key frame image;

s103: and transmitting each key frame image according to the current maximum actual code rate of the network to finish the video conference.

It can be understood that the embodiment of the present application may divide the video conference data into a plurality of video frame images, and then determine, by using a pre-trained salient object detection model, a region that is most interesting to people and needs to be transmitted most in each video frame image, that is, a salient region. Wherein the role of the salient object detection model at least comprises determining salient regions of the video frame images. The model needs to be obtained by pre-training based on historical video frame images, and the specific training process is explained in detail below.

After the salient regions of the video frame images are determined, the video frame images may be grouped, i.e., the video conference data stream may be divided into equal portions. For example, the frame may be a group of five frames or ten frames, but the application is not limited thereto. And respectively selecting the video frame image with the largest salient region from each group as a key frame image. The purpose of this is to be understood as that one frame which is most interesting to people and needs to be transmitted is selected from a plurality of adjacent video frame images for transmission, so that the transmission data volume is reduced, and the transmission efficiency is improved.

And finally, determining the current maximum actual code rate of the network for transmitting the video frame images by using a congestion control algorithm so as to transmit each key frame image according to the current maximum actual code rate of the network and finish the video conference.

From the above description, the video conference implementation method provided by the application can utilize the salient object detection algorithm to process the video to be transmitted in real time and extract the key frame, can still ensure the smooth progress of the video conference under the conditions of low bandwidth and network fluctuation, improve the video conference experience under the conditions of low bandwidth and network fluctuation, adjust the video data transmission amount in real time, reduce network congestion, and highlight key contents.

In one embodiment, referring to fig. 2, the step of training the salient object detection model in advance includes:

s201: learning the image characteristics of the historical video frame images by using a residual error network model; the image features comprise contrast features, brightness features and pixel features;

s202: updating neuron parameters in the residual error network model according to the learned image characteristics;

s203: and performing semantic dimension learning according to the updated neuron parameters, and improving the detection precision of the residual error network model to obtain the significant object detection model.

It can be understood that in the video conference, in order to transmit the video conference data efficiently, the video conference data needs to be subjected to data compression. In the embodiment of the application, compression is realized by selecting key frame images. The selection of the key frame image can be understood as the extraction of the video frame image. Conventionally, the image extraction method extracts 1 frame from every N frames in the order in which the video frame images appear in the video conference. However, this image extraction method can only achieve basic data compression, and cannot guarantee that the extracted image is the most meaningful frame of the N frames. Therefore, the embodiment of the present application needs to train a salient object detection model to be able to select 1 frame of most interest from the N frames of video frame images as a key frame image.

The specific training steps are as follows:

firstly, a residual error network model is used for learning the image characteristics of historical video frame images in advance, and the training speed is improved.

The convolutional neural network mainly comprises a convolutional layer, a pooling layer, an activation function layer and a full-connection layer. The convolution layer is responsible for performing convolution operation by utilizing a convolution kernel so as to extract data features. The pooling layer is responsible for data compression of input data, and parameter quantity is reduced. The activation function layer may increase the ability of the entire convolutional neural network to fit complex non-linear cases. The fully-connected layer can map input data into different categories and finally output predicted values of the network. The residual error network model transfers the front layer to the rear layer through identity mapping in a near-path connection mode, which is equivalent to a shallow network and identity mapping function. Thus, only learning the residual function achieves this part of the training goal. The method is not required to add extra computational burden to the network with the deep layer number, so the method is often used in image processing and other scenes.

In order to learn the image characteristics of historical video frame images in advance, the time required for model training is reduced. The present examined embodiment uses parameters pre-trained with a large number of data sets as initial parameters for model training.

Combining the characteristics of different layers through a multi-scale fusion module, and forcing the neuron parameters in the neural network to have larger variation amplitude.

The jump connection of the multi-scale fusion module can repeatedly utilize the characteristics of the previous layer, the gradient back propagation capacity is improved, the previous layer can receive additional supervision from a loss function through a short distance, and the convolutional neural network is easier to train. In order to increase the receptive field, the embodiment of the application adopts four asymmetric convolution kernels with different scales, the effect of the asymmetric convolution can approximate to the square convolution, and meanwhile, the parameter quantity is greatly reduced. It should be noted that not all parameters in the square convolution kernel have the same significance. The parameters at the center crossing position are more important, while the parameters at the corner positions are less important. The parameters of the center crossing location can be enhanced using a one-dimensional convolution of horizontal and vertical. Neuron parameters in the residual network model can be updated according to the learned image features.

And thirdly, learning a high-dimensional space feature vector of semantic dimensionality by decomposing visual features.

The quality of the image can be judged by various dimensions. Among them, the preference for considering human visual perception is the luminance characteristics, the contrast characteristics and the structural characteristics.

The luminance characteristics may be expressed as:

the contrast characteristic can be expressed as:

the structural features can be expressed as:

the embodiment of the application introduces the structural similarity integrating three characteristics:

wherein, mu_xIs the average value of x, μ_yIs the average value of y and is,

is the variance of x and is,

is the variance of y, c₁，c₂，c₃Is constant in order to avoid a denominator of 0.

In order to make the salient object more accurate, the embodiment of the application also adopts an intersection ratio loss function. The function is commonly used in target detection scenes and is used for measuring the accuracy of detection frames, so that the coincidence degree between a predicted object and a real object is compared, and wrong boundaries are punished.

The cross-over loss function is expressed as follows:

the objective function of the model is defined as the sum of the structural similarity loss function and the intersection-to-intersection ratio loss function. Finally, the salient object detection algorithm model is shown in fig. 10.

From the above description, the video conference implementation method provided by the application can train a salient object detection model.

In one embodiment, determining the salient region of each video frame image by using a pre-trained salient object detection model includes: and carrying out binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.

In an embodiment, referring to fig. 3, grouping each of the video frame images, and selecting a video frame image with a largest salient region from each of the groups as a key frame image includes:

s301: respectively determining the number of significant pixel points in a significant region of each video frame image;

s302: and selecting the video frame image containing the most significant pixel points from the grouping as the key frame image.

It can be understood that the embodiment of the present application may utilize a salient object detection algorithm to binarize the video frame image according to the pixel points, and then calculate the number of salient pixel points of each video frame image. And selecting the image containing the most significant pixel points in the adjacent frames as a key frame image.

It should be noted that adjacent frames may be understood as frames around a certain frame, and assuming that the grouping rule is that every N frames are grouped, the adjacent frame of a certain frame and the frame should belong to the same group. In addition, for the same video frame image, the more the significant pixel points are, the larger the significant area of the video frame image is, and the more important the video frame image is. Therefore, in the embodiment of the present application, the video frame image with the largest number of significant pixels needs to be used as the key frame image.

In the embodiment of the application, firstly, binarization can be performed on an image according to pixel points by using a salient object detection algorithm, that is, the pixel value of a salient pixel point is set to be 255, and the pixel value of a non-salient pixel point is set to be 0; and then counting the number of the significant pixel points in each video frame image. Assuming that N is 5, the image containing the most significant pixel points in the 5 video frame images can be selected as the key frame image of the group.

As can be seen from the above description, the video conference implementation method provided by the present application can group the video frame images, and respectively select the video frame image with the largest salient region from the groups as the key frame image.

In an embodiment, referring to fig. 4, the transmitting each key frame image according to the current maximum actual bitrate of the network to complete the video conference includes:

s401: determining the current maximum actual code rate by using a congestion control algorithm;

s402: determining the transmission code rate of a significant region and the transmission code rate of a non-significant region in each key frame image according to the current maximum actual code rate;

s403: and transmitting the salient region according to the transmission code rate of the salient region, and transmitting the non-salient region according to the transmission code rate of the non-salient region to finish the video conference.

It can be understood that, in the embodiment of the present application, a network traffic monitoring system may be used to implement actual code rate acquisition on a current network.

The network flow monitoring system adopts a mainstream congestion control algorithm Sendside-BWE to estimate the bandwidth. All code rate calculation modules are moved to a transmitting end to be carried out, and a Trendline filter is adopted to replace a Kalman filter in the prior art. The actual measurement shows that the algorithm can better and more quickly estimate the code rate.

The network flow monitoring system adopts a congestion control algorithm Sendside-BWE adopted by the mainstream Google to estimate the bandwidth. Specifically, 1) when a sending end sends an RTP data packet, a transmission layer sequence number is set in an RTP header extension; 2) recording the sequence number and the packet arrival time after the data packet arrives at the receiving end, and then returning the constructed message to the transmitting end by the receiving end; 3) the sending end analyzes the message, executes a Sendside-BWE algorithm, and calculates to obtain a delay-based code rate Ar; 4) and finally, comparing the Ar with the code rate As based on the packet loss rate to obtain the final target code rate, and forming a complete loop. The actual measurement shows that the algorithm can better and more quickly realize code rate estimation.

Specifically, in the embodiment of the present application, after the current maximum actual code rate is determined by using the congestion control algorithm, the amount of data that can be used by the current network can be determined according to the actual code rate. Different encoding parameters are set for different regions (salient regions and non-salient regions) of the key frame image, and different encoding methods are used.

It can be understood that, in the transmission process of the video conference data, the usable data volume (network bandwidth) is limited, and in consideration of the difference in significance degree of different regions, the low-bit-rate coding can be performed on the non-significant region, the high-bit-rate coding can be performed on the significant region, and then the clear video image can be transmitted as much as possible under the condition of the limited data volume.

The low-rate coding is realized by combining global motion compensation and local motion compensation. Since global motion compensated coding is suitable for background region coding and local motion compensated coding is suitable for local complex region coding. Firstly, performing global motion estimation on an input video frame image to obtain global motion model parameters; and then calculating the position of a reference point, and carrying out differential coding on the motion vector of the reference point to obtain a reference point coding stream. And at the encoder end, decoding the encoded current frame, storing the decoded reconstructed image in a buffer area, and changing the current decoded image according to the obtained global motion estimation parameters so as to be used for global motion compensation of the next frame.

And the high-rate coding is carried out by adopting an H265 coding mode. The H265 encoding mode adds block division based on the quad-tree based on the traditional encoding technology and adopts the variable size DCT transform and other technologies, thereby realizing higher frame rate, higher compression rate and higher definition.

If the available data volume (network bandwidth) is not enough to transmit the data volume of the current key frame image, coding the significant region with a high code rate, distributing a certain proportion (for example, 0.8 times) of total data volume, coding the non-significant region with a low code rate, and distributing a certain proportion (for example, 0.2 times) of total data volume. And if the available flow is larger than the data volume of the current key frame image, encoding by using the normal code rate. The coded data can be put into a sending cache, and a video conference data sending thread is started to send video conference data; the video conference terminal can receive the video data, decode the video data and play video conference images to complete the video conference.

From the above description, the video conference implementation method provided by the application can transmit each key frame image according to the current maximum actual code rate of the network, so as to complete the video conference.

Based on the same inventive concept, the embodiment of the present application further provides a video conference implementation apparatus, which can be used to implement the methods described in the foregoing embodiments, as described in the following embodiments. Because the principle of the video conference realization device for solving the problems is similar to that of the video conference realization method, the implementation of the video conference realization device can refer to the implementation of the software performance reference determination method, and repeated parts are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

In an embodiment, referring to fig. 5, in order to improve the video conference experience under the low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important content, the present application provides a video conference implementing apparatus, including: a salient region determining unit 501, a key frame selecting unit 502 and a video conference realizing unit 503.

A salient region determining unit 501, configured to determine a salient region of each video frame image by using a pre-trained salient object detection model;

a key frame selecting unit 502, configured to group each of the video frame images, and select a video frame image with a largest salient region from each group as a key frame image;

and a video conference implementing unit 503, configured to transmit each key frame image according to the current maximum actual code rate of the network, so as to complete a video conference.

In an embodiment, referring to fig. 6, the video conference implementing apparatus further includes: image feature learning section 601, parameter updating section 602, and model generating section 603.

An image feature learning unit 601 configured to learn image features of historical video frame images using a residual network model; the image features comprise contrast features, brightness features and pixel features;

a parameter updating unit 602, configured to update neuron parameters in the residual network model according to the learned image features;

and the model generating unit 603 is configured to perform semantic dimension learning according to the updated neuron parameters, improve the detection precision of the residual error network model, and obtain the significant object detection model.

In an embodiment, the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.

In an embodiment, referring to fig. 7, the key frame selecting unit 502 includes: a pixel point number determining module 701 and a key frame selecting module 702.

A pixel point number determining module 701, configured to determine the number of salient pixel points in a salient region of each video frame image respectively;

a key frame selecting module 702, configured to select, from the group, a video frame image with the largest number of significant pixels as the key frame image.

In an embodiment, referring to fig. 8, the video conference implementing unit 503 includes:

an actual code rate determining module 801, configured to determine the current maximum actual code rate by using a congestion control algorithm;

a transmission code rate determining module 802, configured to determine, according to the current maximum actual code rate, a transmission code rate of a significant region and a transmission code rate of a non-significant region in each key frame image;

a transmission module 803, configured to transmit the significant region according to the transmission code rate of the significant region, and transmit the insignificant region according to the transmission code rate of the insignificant region, so as to complete a video conference.

In terms of hardware, in order to improve the video conference experience under a low bandwidth condition, adjust the transmission amount of video data in real time, reduce network congestion, and highlight important contents, the present application provides an embodiment of an electronic device for implementing all or part of the contents in the video conference implementation method, where the electronic device specifically includes the following contents:

a Processor (Processor), a Memory (Memory), a communication Interface (Communications Interface) and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the video conference realization device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may be implemented with reference to the embodiment of the video conference implementation method and the embodiment of the video conference implementation apparatus in the embodiment, and the contents thereof are incorporated herein, and repeated descriptions are omitted.

It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..

In practical applications, part of the video conference implementation method may be executed on the electronic device side as described in the above, or all operations may be completed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.

The client device may have a communication module (i.e., a communication unit), and may be in communication connection with a remote server to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.

Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 9, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.

In one embodiment, the video conference implementation functionality may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:

In another embodiment, the video conference realizing apparatus may be configured separately from the central processor 9100, for example, the video conference realizing apparatus of the data composite transmission apparatus may be configured as a chip connected to the central processor 9100, and the function of the video conference realizing method is realized by the control of the central processor.

As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, which may be referred to in the prior art.

As shown in fig. 9, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.

The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.

The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.

The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.

The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).

The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless lan module, may be disposed in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.

An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the video conference implementation method with the execution subject being the server or the client in the foregoing embodiments, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the video conference implementation method with the execution subject being the server or the client in the foregoing embodiments, for example, when the processor executes the computer program, the processor implements the following steps:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video conference implementation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of pre-training the salient object detection model comprises:

3. The method of claim 1, wherein the determining the salient region of each video frame image by using a pre-trained salient object detection model comprises:

4. The method of claim 1, wherein the grouping the video frame images and selecting the video frame image with the largest salient region from the groups as a key frame image comprises:

5. The method of claim 1, wherein the transmitting each key frame image according to the current maximum actual bitrate of the network to complete the video conference comprises:

6. A video conference effectuating apparatus, comprising:

7. The video conference effectuating apparatus in accordance with claim 6, further comprising:

8. The apparatus according to claim 6, wherein the salient region determining unit is specifically configured to perform binarization processing on pixel points of each video frame image by using the salient object detection model to obtain the salient region.

9. The apparatus as claimed in claim 6, wherein the key frame extracting unit comprises:

10. The apparatus of claim 6, wherein the video conference realization unit comprises:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video conference implementation method of any one of claims 1 to 5 when executing the program.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the video conference realization method of any one of claims 1 to 5.