CN114596514A

CN114596514A - Video conference system and method for eliminating disturbance thereof

Info

Publication number: CN114596514A
Application number: CN202011314658.1A
Authority: CN
Inventors: 阮钰珊; 曹凌帆; 廖述群; 范圣欣; 黄宇杰
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-06-07

Abstract

The invention provides a video conference system and a method for eliminating disturbance thereof. The method comprises the following steps. A video conference is initiated and a video stream is acquired by an image capture device. At least one first image object within a first video picture in a video stream is detected using a deep learning model. And judging whether the at least one first image object is a disturbing object. In response to determining that at least one of the first image objects is a disturbing object, the disturbing object is removed from the first video frame.

Description

Video conference system and method for eliminating disturbance thereof

Technical Field

The present invention relates to video conferencing systems, and more particularly, to a video conferencing system and a method for eliminating disturbance thereof.

Background

With the outbreak of new coronavirus epidemics in countries around the world, the need for many people to use video conferencing at home for remote work or online courses has increased dramatically. In the process of video conference, the user may sometimes have unexpected disturbance in the surrounding environment to affect the video conference, for example, in a home use situation, a family or a pet may break into the background of the video conference without paying attention to the video conference to cause image disturbance, or a child or a pet suddenly makes a disturbance sound to affect the video conference.

Disclosure of Invention

The disclosure relates to a video conference system and a method for eliminating disturbance thereof, which can eliminate disturbance of a video conference as soon as possible without influencing other conference participants.

The embodiment of the invention provides a method for eliminating disturbance, which is suitable for a video conference system and comprises the following steps. A video conference is initiated and a video stream is acquired by an image capture device. At least one first image object within a first video picture in a video stream is detected using a deep learning model. And judging whether the at least one first image object is a disturbing object. In response to determining that at least one of the first image objects is a disturbing object, the disturbing object is removed from the first video frame.

An embodiment of the present invention provides a video conference system, which includes a display, an image capturing apparatus, a storage device, and a processor. The processor is coupled to the display, the image capturing device and the storage device, and is configured to perform the following steps. A video conference is initiated and a video stream is acquired by an image capture device. At least one first image object within a first video picture in a video stream is detected using a deep learning model. And judging whether the at least one first image object is a disturbing object. In response to determining that at least one of the first image objects is a disturbing object, the disturbing object is removed from the first video frame.

Based on the above, in the embodiment of the present invention, when the user has a disturbance in the surrounding environment, the video conference system can automatically detect the disturbance object in the video frame, so as to eliminate the disturbance object in the video frame. Therefore, disturbance of the video conference can be eliminated in real time, and therefore fluency of the video conference is improved.

Drawings

FIG. 1 is a block diagram of a video conferencing system in accordance with one embodiment of the present invention;

FIG. 2 is a flow diagram of a method of eliminating disturbance according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a disturbance object removal according to an embodiment of the present invention;

FIG. 4 is a flow diagram of a method of eliminating disturbance according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating comparison of video frames to detect a disturbing object according to an embodiment of the present invention;

FIG. 6 is a flow diagram of a method of eliminating disturbance according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the detection of an interfering object using a deep learning model according to an embodiment of the invention;

FIG. 8 is a block diagram of a video conferencing system in accordance with one embodiment of the present invention;

FIG. 9 is a flow chart of a method of eliminating disturbance according to an embodiment of the present invention.

Description of the reference numerals

10, a video conference system;

110 is a display;

120, a storage device;

130, a processor;

140, a camera;

150, a microphone;

a first region 310;

320, a second area;

img _ b is a background picture;

img _1 and Img _ c are first video pictures;

img _ 1' first video picture processed;

obj _ in is the disturbing object;

img _ r is a basic video picture;

obj _ c1, Obj _ c2, first image object;

obj _ r1, reference image object;

img _ t 1-Img _ tn is a second video picture;

img _ f 1-Img _ fn is a face image;

m1, deep learning model;

s201 to S204, S401 to S406, S601 to S608, S901 to S907.

Detailed Description

Reference will now be made in detail to exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

Fig. 1 is a block diagram of a video conferencing system in accordance with an embodiment of the present invention. Referring to fig. 1, the video conference system 10 includes a display 110, a storage device 120, a processor 130, and an image capture apparatus 140. The processor 130 is coupled to the display 110, the storage device 120 and the image capturing apparatus 140. In some embodiments, video conferencing system 10 may be implemented to include a computer system having a display 110, a storage device 120, and a processor 130, and an image capture device 140 external to the computer system. For example, the video conference system 10 may be composed of a notebook computer or a desktop computer and an external camera, but the invention is not limited thereto. In some embodiments, video conferencing system 10 may be implemented by integrating display 110, storage device 120, processor 130, and image capture device 140 into a single electronic device. For example, the video conference system 10 may be implemented as an electronic device with an image capturing function, such as a smart phone, a tablet computer, and a notebook computer, but the invention is not limited thereto.

The Display 110 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) Display, an Organic Light Emitting Diode (OLED), or other types of displays, which are not limited in the present invention.

The storage device 120 is used for storing data such as files, images, instructions, program codes, software components, etc., and may be any type of fixed or removable Random Access Memory (RAM), read-only memory (ROM), flash memory (flash memory), hard disk or other similar devices, integrated circuits, and combinations thereof.

The image capturing device 140 is used for capturing images to generate a video stream, and includes a camera lens having a lens and a photosensitive element. The photosensitive component is used for sensing the intensity of light rays entering the lens so as to generate an image. The photosensitive device may be, for example, a Charge Coupled Device (CCD), a complementary metal-oxide semiconductor (CMOS) device, or other devices, which are not limited herein.

Processor 130 is coupled to display 110, storage device 120, and image capture device 140 to control the overall operation of video conferencing system 10. The Processor 130 may be a Central Processing Unit (CPU), or other Programmable general purpose or special purpose Microprocessor (Microprocessor), Digital Signal Processor (DSP), Programmable controller, Application Specific Integrated Circuit (ASIC), Programmable Logic Device (PLD), or other similar Device or combination thereof. The processor 130 may execute program codes, software modules, instructions, etc. recorded in the storage device 120 to implement the method of eliminating disturbance of the embodiment of the present invention.

FIG. 2 is a flow chart of a method of eliminating disturbance according to an embodiment of the present invention. Referring to fig. 2, the method of the present embodiment is applied to the video conference system 10 of the above embodiment, and the detailed steps of the present embodiment are described below with reference to various components in the video conference system 10.

In step S201, the processor 130 initiates a video conference and obtains a video stream through the image capturing device 140. The processor 130 may initiate a video conference by executing a video conference software, and the display 110 may display a user interface of the video conference software. During the video conference, the image capturing device 140 continuously acquires images to generate a video stream. The image capture device 140 may provide the video stream to a computer system comprised of the processor 130 and the storage 120. The video stream may include a plurality of video frames respectively corresponding to different time points. The processor 130 may continuously provide video frames including the user to other conference participants via the network.

In step S202, the processor 130 detects at least one first image object in a first video frame of the video stream by using the deep learning model. The deep learning model is used for object detection (object detection). The deep learning model can be R-CNN, Fast R-CNN, Faster R-CNN, YOLO or SSD and the like used for object detection in a Convolutional Neural Network (CNN) model, and the Network architecture used by the deep learning model is not limited by the invention. In detail, after the image capturing apparatus 140 acquires the current video frame (i.e., the first video frame), the processor 130 may utilize the deep learning model to detect and identify one or more first image objects corresponding to at least one object classification result. For example, processor 130 may utilize a deep learning model to detect one or more first image objects classified as "people" within the current video picture. In other words, the processor 130 may also detect a plurality of first image objects corresponding to different object classification results in the current video frame through the deep learning model.

In step S203, the processor 130 determines whether at least one first image object is an interfering object. In detail, in an embodiment, after acquiring the first image objects in the current video frame (i.e. the first video frame), the processor 130 may directly identify whether all the first image objects are the disturbing objects by using another deep learning model, so as to determine whether the current video frame includes the disturbing objects. In one embodiment, after acquiring the first image object in the current video frame (i.e., the first video frame), the processor 130 may determine whether the current video frame includes the disturbing object by comparing the image object of the current video frame with the image object of the previous video frame.

In step S204, in response to determining that at least one of the first image objects is a disturbing object, the processor 130 removes the disturbing object from the first video frame. The processor 130 may then provide the first video frame without the disturbing object to the other conference participants so that the disturbing object is not seen by the other conference participants of the video conference. It should be noted that, in some embodiments, after determining that at least one first image object is a disturbing object, the processor 130 may further remove the disturbing object in other video frames acquired after the first video frame according to the position information of the first image object.

In one embodiment, the processor 130 may replace the first video frame with a video frame without the disturbing object, thereby removing the disturbing object. For example, in response to determining that the first video frame includes the disturbing object, the processor 130 may replace the first video frame with a video frame acquired 3 seconds ago. Alternatively, in one embodiment, the processor 130 may remove the disturbing object from the first video frame by various image processing techniques, such as covering the disturbing object with a predetermined pattern or blurring the disturbing object. Alternatively, in an embodiment, the processor 130 may replace the second area of the first video frame including the disturbing object with the first area of the third video frame, so as to achieve the purpose of removing the disturbing object. The third video picture may be a background picture taken before the video conference starts. Alternatively, the third video frame may be a video frame acquired before the first video frame.

For example, fig. 3 is a schematic diagram of removing a disturbing object according to an embodiment of the invention. Referring to fig. 3, the processor 130 may replace the second region 320 of the first video frame Img _1 including the disturbing object Obj _ in with the first region 310 of the background frame Img _ b to generate a processed first video frame Img _ 1'. In some embodiments, the size and position of the second region 320 including the disturbing object Obj _ in may be determined by a bounding Box (bounding Box) provided by the deep learning model, and further the size and position of the first region 310 are obtained.

Based on the foregoing, the processor 130 can determine whether the first image object in the first video frame is the disturbing object by different determination mechanisms. Examples will be individually listed below for illustration.

FIG. 4 is a flow chart of a method of eliminating disturbance according to an embodiment of the present invention. Referring to fig. 4, the method of the present embodiment is applied to the video conference system 10 of the above embodiment, and the detailed steps of the present embodiment are described below with reference to various components in the video conference system 10.

In step S401, the processor 130 initiates a video conference and obtains a video stream through the image capturing device 140. In step S402, in response to the triggering operation, the processor 130 selects a reference video frame from the video stream. Next, in step S403, the processor 130 detects at least one reference image object in the reference video frame by using the deep learning model. In detail, the triggering operation may be a user input operation, and the user may start the video conference system 10 to perform the anti-interference function through the user input operation. The user input operation may be a voice input, a touch input, a mouse input, a keyboard input, or the like, which is not limited in the present invention. For example, the user may activate the video conference system 10 to perform the tamper-proof function by pressing a particular function key. In response to receiving the trigger operation, the processor 130 may set a previous video frame as a reference video frame and classify a reference image object in the reference video frame as a non-disturbing object. In addition, the operation content of detecting the reference image object in the reference video frame by using the deep learning model is similar to the operation content of detecting the first image object in the first video frame in the video stream by using the deep learning model, which can be described with reference to the foregoing embodiments.

In step S404, the processor 130 detects at least one first image object in a first video frame of the video stream by using the deep learning model. In step S405, the processor 130 determines whether at least one first image object is an interfering object. In this embodiment, the step S405 can be implemented as steps S4051 to S4053.

In step S4051, the processor 130 may determine whether the at least one first image object is an interfering object by comparing the at least one first image object in the first video frame with the at least one reference image object in the reference video frame. In one embodiment, the processor 130 may determine whether the first image object in the first video frame corresponds to the reference image object in the reference video frame according to the object classification result and the image position of the reference image object and the object classification result and the image position of the first image object, that is, whether the first image object and the reference image object correspond to the same real scene object. Based on the criterion that the reference image object in the reference video frame is classified as a non-disturbing object, if the processor 130 finds that a first image object does not correspond to the reference image object, it can determine that the first image object is a new disturbing object.

Then, in step S4052, in response to the at least one first image object not corresponding to the at least one reference image object, the processor 130 determines that the at least one first image object is an interfering object. In step S4053, in response to the at least one first image object corresponding to the at least one reference image object, the processor 130 determines that the at least one first image object is not an interfering object. For example, if the reference image object of two conference participants is included in the reference video frame acquired before the trigger operation is received, the first image object corresponding to the two conference participants in the first video frame will not be determined as the disturbing object by the processor 130. In step S406, in response to determining that at least one of the first image objects is a disturbing object, the processor 130 removes the disturbing object from the first video frame.

FIG. 5 is a diagram illustrating an embodiment of comparing video frames to detect a disturbing object. Referring to fig. 5, in response to receiving the trigger operation at time t2, the processor 130 may determine the base video frame Img _ r obtained at time t 1. For example, the base video frame Img _ r may be a video frame acquired 2 seconds before the trigger operation. The processor 130 may detect the reference image object Obj _ r1 from the base video picture Img _ r. Then, the processor 130 may acquire the first video frame Img _ c acquired at the time point t3, and detect two first image objects Obj _ c1 and Obj _ c2 from the first video frame Img _ c. In response to determining that the first image object Obj _ c1 corresponds to the base image object Obj _ r1, the processor 130 may determine that the first image object Obj _ c1 is not a disturbing object. In response to determining that the first image object Obj _ c2 does not correspond to any reference image object, the processor 130 may determine that the first image object Obj _ c2 is a disturbing object. Thus, the processor 130 may remove the first image object Obj _ c2 from the first video frame Img _ c and then provide the processed video frame not including the first image object Obj _ c2 to the other conference participants.

FIG. 6 is a flow chart of a method of eliminating disturbance according to an embodiment of the present invention. Referring to fig. 6, the method of the present embodiment is applied to the video conference system 10 of the above embodiment, and the detailed steps of the present embodiment are described below with reference to various components in the video conference system 10. In addition, for better clarity, please refer to fig. 6 and fig. 7 together, and fig. 7 is a schematic diagram illustrating a method for detecting a disturbing object by using a deep learning model according to an embodiment of the invention.

In step S601, the processor 130 acquires a background frame Img _ b by using the image capturing apparatus 140. The background picture Img _ b includes a background of the user in the video conference. In some embodiments, the background picture Img _ b may be a picture previously taken at the start of the video conference. For example, the first frame 210 can be a video frame captured 5 seconds before the video conference is started, which is not limited in the present invention.

In step S602, the processor 130 acquires a plurality of second video frames Img _ t1 to Img _ tn for the user by using the image capturing device 140. The second video frames Img _ t 1-Img _ tn can be the video frames captured before the video conference is started or the video frames captured during the video conference. The second video pictures Img _ t 1-Img _ tn include images of the user.

In step S603, the processor 130 performs an image subtraction operation on the second video frames Img _ t1 to Img _ tn according to the background frame Img _ b to obtain a plurality of face images Img _ f1 to Img _ fn. The processor 130 may subtract the background pictures Img _ b from the second video pictures Img _ t1 to Img _ tn one by one according to a background subtraction method (background subtraction) to obtain a plurality of face images Img _ f1 to Img _ fn.

In step S604, the processor 130 trains another deep learning model M1 by using the face images Img _ f 1-Img _ fn as a training data set. In this regard, the processor 130 will use the training data set to train an image classifier, wherein the image classifier is an image recognition model based on a deep learning algorithm. The deep learning model M1 trained in step S604 is used to classify the model input image object as an intrusive object or a non-intrusive object. In some embodiments, this image classifier may be based on a Convolutional Neural Network (CNN) or other deep learning algorithm. More specifically, after the convolutional neural network architecture of the deep learning model M1 is planned, the weight information in the deep learning model M1 must be determined using the classification solution information of the face images Img _ f1 to Img _ fn and the face images Img _ f1 to Img _ fn, so as to train the deep learning model M1. For example, the classification solution information of the face images Img _ f 1-Img _ fn may be labeled as a non-disturbing object as a classification result "1", respectively.

In step S605, the processor 130 initiates a video conference and obtains a video stream through the image capturing device 140. In step S606, the processor 130 detects at least one first image object Obj _ c1 and Obj _ c2 in the first video frame Img _ c of the video stream by using the deep learning model. The deep learning model in step S606 is a different model from the deep learning model M1, and step S606 is a deep learning model for object detection. In step S607, the processor 130 determines whether at least one of the first image objects Obj _ c1 and Obj _ c2 is a disturbing object. In this embodiment, step S607 can be implemented as steps S6071 to S6073.

In step S6071, the processor 130 determines whether the at least one first image object Obj _ c1 or Obj _ c2 is an interfering object by classifying the at least one first image object Obj _ c1 or Obj _ c2 as an interfering object or a non-interfering object by using another deep learning model M1. In other words, the processor 130 may utilize the deep learning model M1 trained in step S604 to identify whether each of the first image objects Obj _ c1 and Obj _ c2 is a disturbing object. For example, in some embodiments, the processor 130 may utilize the deep learning model M1 to classify the first image objects Obj _ c1 and Obj _ c2 into two classification results, i.e., "1" or "0". If the classification result is '1', representing that the model input image object is a face image of the user; if the classification result is "0", the representative model input image object is not the face image of the user.

Then, in step S6072, in response to the another deep learning model M1 classifying the at least one first image object Obj _ c2 as a disturbing object, the processor 130 determines that the at least one first image object Obj _ c2 is a disturbing object. In step S6073, in response to the another deep learning model M1 classifying the at least one first image object Obj _ c1 as a non-disturbing object, the processor 130 determines that the at least one first image object Obj _ c1 is not a disturbing object.

In step S608, in response to determining that the at least one first image object Obj _ c2 is a disturbing object, the processor 130 removes the disturbing object from the first video frame. The processor 130 may then remove the first image object Obj _ c2 from the first video frame Img _ c and provide the processed video frame not including the first image object Obj _ c2 to the other conference participants.

Fig. 8 is a block diagram of a video conferencing system in accordance with another embodiment of the present invention. Referring to fig. 8, in an embodiment, the video conference system 10 further includes a microphone 150 coupled to the processor 130. The microphone 150 is used for receiving sound signals. In some embodiments, the microphone 150 may be a built-in microphone embedded in an electronic device such as a notebook computer, a desktop computer, a smart phone, a tablet computer, or the like. In other embodiments, microphone 150 may be an external microphone that is separate from the computer system, as the present invention is not limited in this respect.

FIG. 9 is a flow chart of a method of eliminating disturbance according to an embodiment of the present invention. Referring to fig. 9, the method of the present embodiment is applied to the video conference system 10 of the above embodiment, and the detailed steps of the present embodiment will be described below with reference to various components in the video conference system 10 of fig. 8.

In step S901, the processor 130 initiates a video conference and obtains a video stream through the image capturing device 140. In step S902, the processor 130 detects at least one first image object in a first video frame of the video stream by using the deep learning model. In step S903, the processor 130 determines whether at least one first image object is an interfering object. In step S904, in response to determining that at least one of the first image objects is a disturbing object, the processor 130 removes the disturbing object from the first video frame. The detailed implementation of the above steps is described in the foregoing embodiments, and is not described herein again.

In step S905, the processor 130 acquires the audio signal by using the microphone 150 during the video conference. Specifically, the sound signal received by the microphone 150 may include the sound of the user speaking and the sound of the surrounding environment where the user is located, such as the sound of a pet, the sound of other people speaking, or other sudden sounds, which is not limited by the invention. Under normal conditions, the volume of the sound signal input by the user to the microphone does not change dramatically and is smaller than the volume threshold value. Therefore, if the processor 130 determines that the volume of the sound signal exceeds the volume threshold, it can be inferred that the disturbing sound is present.

In step S906, in response to the volume of the audio signal being greater than the volume threshold, the processor 130 adjusts the microphone 150 to the mute mode. In step S907, in response to that the volume of the audio signal is not greater than the volume threshold, the processor 130 adjusts the microphone 150 to the normal reception mode. That is, the processor 130 continuously determines whether the volume of the sound signal received by the microphone 150 exceeds a volume threshold, which may be a predetermined value or a statistical value determined by the processor 130 according to the volume record. For example, the processor 130 may determine whether a volume decibel (dB) value of the sound signal is greater than a volume threshold. In one embodiment, the processor 130 can switch the microphone 150 from the mute mode to the normal receiving mode in response to the volume of the audio signal switching from the large volume threshold to the small volume threshold.

In one embodiment, the volume threshold may be determined according to the volume record within a predetermined time period. The processor 130 may record a volume record of the sound signal received by the microphone 150 in a predetermined time period, and determine a volume threshold according to the volume record. In one embodiment, the processor 130 may perform a statistical calculation on the volume records within a predetermined time period to obtain a statistical value, and use the statistical value as the volume threshold. The statistical value may be a quartile or the like. In addition, the length of the preset time period is not limited in the embodiments of the present invention, and may be set as required.

For example, the processor 130 samples and records the decibel value (dB) of volume for each second and continues to store a total of 600 volume records over the last 10 minutes. As such, the volume recording may be as shown in table 1 below:

TABLE 1

Time (hour, minute, second)	Volume (dB)
		10:43:21	61.2
10:43:22	59.8
		…
11:43:21	62.4

The processor 130 may then determine the volume threshold based on the volume records in table 1. Thus, the processor 130 may obtain the third quartile of 61.9(dB) of the volume record in table 1, and use the third quartile of 61.9(dB) as the volume threshold.

In summary, in the embodiment of the invention, when the user is performing the video conference, the video conference system can automatically detect the interference sound and the disturbing object in the video frame, and automatically filter the interference sound and the image object as the disturbing object. Thus, the other conference participants will not be seriously affected by these disturbing sounds or disturbing objects, so that the video conference is interrupted. Therefore, the embodiment of the invention can eliminate the disturbance of the video conference in real time when the user carries out the video conference, so that the video conference can be smoothly and smoothly carried out.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for eliminating interruptions in a video conferencing system, comprising:

starting a video conference, and acquiring a video streaming through image capturing equipment;

detecting at least one first image object in a first video picture in the video streaming by using a deep learning model;

judging whether the at least one first image object is a disturbing object; and

removing the disturbing object from the first video frame in response to determining that the at least one first image object is the disturbing object.

2. The method of claim 1, wherein the step of determining whether the at least one first image object is the disturbing object comprises:

judging whether the at least one first image object is the disturbing object or not by comparing the at least one first image object in the first video picture with at least one reference image object in a reference video picture;

determining the at least one first image object as the disturbing object in response to the at least one first image object not corresponding to the at least one reference image object; and

and determining that the at least one first image object is not the disturbing object in response to the at least one first image object corresponding to the at least one reference image object.

3. The method of claim 2, wherein before the step of determining whether the at least one first image object is the disturbing object, the method further comprises:

selecting the reference video picture from the video streaming in response to a trigger operation; and

and detecting the at least one reference image object in the reference video picture by using the deep learning model.

4. The method of claim 1, wherein the step of determining whether the at least one first image object is the disturbing object comprises:

determining whether the at least one first image object is the disturbing object by classifying the at least one first image object as the disturbing object or a non-disturbing object using another deep learning model;

classifying the at least one first image object as the disturbing object in response to the another deep learning model, and determining the at least one first image object as the disturbing object; and

classifying the at least one first image object as the non-disturbing object in response to the another deep learning model, and determining that the at least one first image object is not the disturbing object.

5. The method of claim 4, wherein prior to the step of determining whether the at least one first image object is the disturbing object, the method further comprises:

acquiring a background picture by using the image capturing device;

acquiring a plurality of second video pictures for a user by using the image capturing device;

performing image subtraction operation on the second video image according to the background image to obtain a plurality of face images; and

training the other deep learning model using the face image as a training data set.

6. The method of claim 1, wherein removing the disturbing object from the first video frame comprises:

replacing a second region in the first video frame with a first region of a third video frame, wherein the second region includes the disturbing object.

7. The method of eliminating interruptions according to claim 1, further comprising:

acquiring a sound signal by using a microphone during the video conference;

in response to the volume of the sound signal being greater than a volume threshold, adjusting the microphone to a mute mode; and

and in response to the volume of the sound signal not being larger than the volume threshold value, adjusting the microphone to be in a normal sound receiving mode.

8. The method of eliminating interruptions according to claim 7, further comprising:

recording the volume record of the sound signal in a preset time period; and

and determining the volume threshold value according to the volume record.

9. A video conferencing system, comprising:

a display;

an image capturing device;

a storage device in which a plurality of instructions are recorded; and

a processor, coupled to the display, the image capture apparatus, and the storage device, configured to:

initiating a video conference and obtaining a video stream through the image capturing device;

detecting at least one first image object within a first video picture in the video stream using a deep learning model;

judging whether the at least one first image object is a disturbing object; and

10. The video conferencing system of claim 9, wherein the processor is further configured to:

11. The video conferencing system of claim 10, wherein the processor is further configured to:

12. The video conferencing system of claim 9, wherein the processor is further configured to:

13. The video conferencing system of claim 12, wherein the processor is further configured to:

acquiring a background picture by using the image capturing device;

acquiring a plurality of second video pictures for a user by utilizing the image capturing device;

14. The video conferencing system of claim 9, wherein the processor is further configured to:

replacing a second region of the first video frame with a first region of a background frame, wherein the second region includes the disturbing object.

15. The video conferencing system of claim 9, further comprising a microphone coupled to the processor, the processor further configured to:

acquiring a sound signal by using the microphone during the video conference;

16. The video conferencing system of claim 15, wherein the processor is further configured to:

recording the volume record of the sound signal in a preset time period; and

and determining the volume threshold value according to the volume record.