CN116582637A

CN116582637A - Screen splitting method of video conference picture and related equipment

Info

Publication number: CN116582637A
Application number: CN202310611376.5A
Authority: CN
Inventors: 王曌; 刘冀洋; 张才荣; 李尚霖
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-08-11

Abstract

The disclosure provides a split screen method of a video conference picture and related equipment. The method comprises the following steps: acquiring a target image acquired by an acquisition unit; detecting a target object in the target image; dividing a video conference picture into at least two sub-pictures according to the target object; and correspondingly displaying the target object in the target image in the at least two sub-pictures.

Description

Screen splitting method of video conference picture and related equipment

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a method for splitting a video conference picture and related equipment.

Background

In video conferencing, face-to-face communication interactions are critical. However, when a plurality of persons sit in a conference room to start a video conference, since most of the conference room cameras have only one, the collected video conference pictures have only one, and it is difficult to distinguish the participants and the number of persons from the video conference pictures, and especially the current speaker cannot be identified.

Disclosure of Invention

The disclosure provides a split-screen method of a video conference picture and related equipment, so as to solve or partially solve the above problems.

In a first aspect of the present disclosure, a method for splitting a video conference frame is provided, including:

acquiring a target image acquired by an acquisition unit;

detecting a target object in the target image;

dividing a video conference picture into at least two sub-pictures according to the target object;

and correspondingly displaying the target object in the target image in the at least two sub-pictures.

In a second aspect of the present disclosure, a split-screen device for a video conference screen is provided, including:

an acquisition module configured to: acquiring a target image acquired by an acquisition unit;

a detection module configured to: detecting a target object in the target image;

a partitioning module configured to: dividing a video conference picture into at least two sub-pictures according to the target object;

a display module configured to: and correspondingly displaying the target object in the target image in the at least two sub-pictures.

In a third aspect of the disclosure, a computer device is provided that includes one or more processors, memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method of the first aspect.

In a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of the first aspect.

In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

According to the screen dividing method and the related equipment for the video conference picture, the video conference picture is divided into at least two sub-pictures according to the target objects by detecting the target objects in the target image acquired by the acquisition unit, and the corresponding target objects are correspondingly displayed in the sub-pictures, so that the video conference picture can be automatically divided under the condition that a plurality of participants are contained in the acquired target image, the interaction feeling of the conference room participants and other online participants can be improved in the video conference process, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present disclosure or related art, the drawings required for the embodiments or related art description will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1A shows a schematic diagram of an exemplary system provided by an embodiment of the present disclosure.

Fig. 1B shows a schematic diagram of a video conference screen captured in the scene shown in fig. 1A.

Fig. 2 shows a schematic diagram of an exemplary flow according to an embodiment of the present disclosure.

Fig. 3A shows a schematic diagram of an exemplary target image, according to an embodiment of the present disclosure.

Fig. 3B shows a schematic diagram of displaying a detection frame in a target image according to an embodiment of the present disclosure.

Fig. 3C shows a schematic diagram of one exemplary video conference screen according to an embodiment of the present disclosure.

Fig. 3D shows a schematic diagram of another exemplary video conference screen according to an embodiment of the present disclosure.

Fig. 3E shows a schematic diagram of a split screen mode according to an embodiment of the present disclosure.

Fig. 3F shows a schematic diagram of another split screen mode according to an embodiment of the present disclosure.

Fig. 3G shows a schematic diagram of yet another exemplary video conference screen according to an embodiment of the present disclosure.

Fig. 3H shows a schematic diagram of face key point detection.

Fig. 3I shows a schematic view of a rotated face.

Fig. 4 shows a schematic diagram of an exemplary method provided by an embodiment of the present disclosure.

Fig. 5 shows a hardware architecture diagram of an exemplary computer device provided by an embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of an exemplary apparatus provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Fig. 1A shows a schematic diagram of an exemplary system 100 provided by an embodiment of the present disclosure.

As shown in fig. 1A, the system 100 may include at least one terminal device (e.g., terminal devices 102, 104), a server 106, and a database server 108. The communication link between the terminal devices 102 and 104 and the server 106 and database server 108 may comprise a medium, such as a network, which may include various connection types, such as wired, wireless communication links, or fiber optic cables, etc.

Users 110A-110C may interact with server 106 over a network using terminal device 102 to receive or send messages, etc., and similarly, user 112 may interact with server 106 over a network using terminal device 104 to receive or send messages, etc. Various Applications (APP) may be installed on the terminal devices 102 and 104, such as a video conference type application, a reading type application, a video type application, a social type application, a payment type application, a web browser, an instant messaging tool, and the like. In some embodiments, users 110A-110C and 112 may use the video conferencing-type services provided by server 106 using video conferencing-type applications installed on terminal devices 102 and 104, respectively, and terminal devices 102 and 104 may capture images 1022 and 1042 via cameras (e.g., cameras provided on terminal devices 102 and 104) and may capture live audio via microphones (e.g., microphones provided on terminal devices 102 and 104) and upload to server 106, so that users 110A-110C and 112 may view the pictures of each other and hear the voice of each other via the video conferencing-type applications on terminal devices 102 and 104, respectively.

The terminal devices 102 and 104 may be hardware or software. When the terminal devices 102 and 104 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, electronic book readers, MP3 players, laptop portable computers (Laptop), desktop computers (PC), and the like. When the terminal devices 102 and 104 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

Server 106 may be a server that provides various services, such as a background server that provides support for various applications displayed on terminal devices 102 and 104. Database server 108 may also be a database server that provides various services. It will be appreciated that the database server 108 may not be provided in the system 100 where the server 106 may implement the relevant functions of the database server 108.

The server 106 and database server 108 may be hardware or software. When they are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the method for splitting the video conference screen provided in the embodiment of the present application is generally performed by the terminal devices 102 and 104.

It should be understood that the number of terminal devices, users, servers and database servers in fig. 1A is merely illustrative. There may be any number of terminal devices, users, servers, and database servers, as desired for implementation.

Fig. 1B shows a schematic diagram of a video conference screen 120 acquired in the scene shown in fig. 1A.

As shown in fig. 1B, in the scenario shown in fig. 1A, the video conference screen 120 may display the screens acquired by the two cameras in two sub-screens 1202 and 1204, where the camera on the side of the terminal device 102 acquires the screen of the entire conference room, and therefore, the sub-screen 1202 correspondingly displays the screen including the target objects (for example, facial images) corresponding to the users 110A to 110C.

It can be seen that, due to the fixed position of the camera set in the conference room, when the relative positions of the users 110A to 110C and the camera are different, the sizes and orientations of the target objects of the users 110A to 110C in the sub-picture 1202 all differ, and, due to the fact that the camera of the conference room is generally far away from the seat, it is difficult to see each participant well in the sub-picture 1202. Therefore, the scheme for automatically splitting the screen of the picture acquired by the single lens has great practical significance for improving interaction between a conference room and online participants in a video conference.

In view of this, the embodiments of the present disclosure provide a method for splitting a video conference frame, by detecting a target object in a target image acquired by an acquisition unit, and then dividing the video conference frame into at least two sub-frames according to the target object, and correspondingly displaying the target object in the target image in the sub-frames, so that the video conference frame can be automatically split under the condition that a plurality of participants are included in the acquired target image, thereby being beneficial to improving the interaction feeling of the conference room participant and the participants on other lines in the video conference process, and further improving the user experience.

Fig. 2 shows a flow diagram of an exemplary method 200 provided by an embodiment of the present disclosure. The method 200 may be used to automatically screen a video conference picture. Alternatively, the method 200 may be implemented by the terminal devices 102, 104 of FIG. 1A, or by the server 106 of FIG. 1A. The method 200 is described below as being implemented by the server 106.

As shown in fig. 2, the method 200 may further include the following steps.

In step 202, the server 106 may acquire a target image acquired by an acquisition unit. Taking fig. 1A as an example, the acquisition unit may be a camera provided in the terminal device 102, 104, and the target image may be images 1022 and 1042 acquired by the camera. After the camera captures an image, the terminal device 102, 104 may upload the captured image to the server 106 for processing.

Fig. 3A shows a schematic diagram of an exemplary target image 300, according to an embodiment of the present disclosure.

The target image 300 may be an image acquired by any of the terminal devices of the system 100 participating in the video conference, and as shown in fig. 3A, the target image 300 may include a plurality of target objects 302A-302C of the participants.

Next, at step 204, the server 106 may detect a target object in the target image 300. As an alternative embodiment, object Detection (Object Detection) techniques may be employed to detect the target Object in the target image 300. Alternatively, the target objects 302A to 302C in the target image 300 may be detected using a pre-trained target detection model, and detection frames 304A to 304C corresponding to the target objects 302A to 302C may be obtained, as shown in fig. 3B.

Further, considering that the position of the participant in the conference room is changed at any time, if the position of the detection frame is fixed after the detection frame is obtained, the position of the participant may not follow the position of the face in time when the participant moves, so in some embodiments, the position change of the target object may be tracked in real time by adopting a target tracking technology, and accordingly, the detection frame may also change correspondingly, so that even if the participant moves during the video conference, the target object may be tracked. Alternatively, pre-trained target tracking models, which may be real-time face box and face keypoints detection tracking models based on deep learning, may be used to track target objects 302A-302C in target image 300, with model structures including, but not limited to, various forms of convolutional neural networks (Convolutional Neural Network) and various forms of Transformer networks. Thus, each sub-picture of the split screen can follow the face in the screen in real time through image face detection and tracking.

In some embodiments, after detecting the target object in the target image 300, step 206 may be entered, and the server 106 may determine a split screen layout directly from the target object, thereby dividing the video conference screen based on the split screen layout.

Fig. 3C shows a schematic diagram of an exemplary video conference screen 310, according to an embodiment of the present disclosure. As shown in fig. 3C, the video conference screen 310 is divided into a plurality of sub-screens. In particular, in dividing the sub-frames, in some embodiments, the video conference frames may be divided according to a total number of numbers of target objects in target images acquired by respective terminal devices currently participating in the video conference, taking into account that the video conference scene requires multiparty interaction. For example, taking the scenario shown in fig. 1A as an example, the number of sprites, which is 4 in this example, may be determined according to the total number of target objects in the images acquired by the terminal devices 102, 104, respectively. After determining the number of sprites, the split screen layout may be determined. There are many different split screen layouts. For example, in order to maintain a basic split screen layout of an existing video conference, the video conference screen 310 may be divided into n sub-screens according to the number n of terminal devices, and then each terminal device may be corresponding to the n sub-screens. Then, the sub-pictures corresponding to the terminal equipment are further divided according to the number of target objects in the pictures acquired by the terminal equipment, so that the split-screen layout is obtained. As shown in fig. 3C, there are left and right sub-frames corresponding to the images acquired by the two terminal devices 102 and 104, respectively, wherein the left frame further includes three sub-frames corresponding to the target objects 302A to 302C, respectively. The right sub-picture displays the picture acquired by the terminal device 104 and may include the target object 312. Thus, a complete video conference screen 310 is formed. And, since the sub-pictures corresponding to different terminal devices are divided (e.g., equal-size division) in the picture 310, the user can know the number of terminals specifically participating in the video conference through the split-screen layout.

It will be appreciated that other split screen layouts are possible besides those described in the previous embodiments, for example, the sub-frames are divided directly according to the number of all the target objects. Fig. 3D shows a schematic diagram of another exemplary video conference screen 320 according to an embodiment of the present disclosure. As shown in fig. 3D, the video conference screen 320 is divided into 4 sub-screens of equal size according to the total number of target objects, corresponding to the target objects 302A to 302C and the target object 312, respectively. Therefore, all the participants, whether a plurality of participants in the conference room or a participant singly participate, can occupy a sub-picture which is equal to that of other people, so that each participant can clearly interact with each participant.

In some embodiments, when the split-screen layout is performed, the layout manner may also have different choices according to the number of target objects, for example, the screen is divided into an equilateral sub-screen array (for example, n×n sub-screens) or an unequal sub-screen array (for example, n×m sub-screens) according to the number when the split-screen layout is performed. For example, as shown in fig. 3D, when the number is 4, 2×2 sub-pictures can be divided. For another example, when the number is 3, the frame 320 may be divided into two rows, a first row for one sub-frame and a second row for two sub-frames, as shown with reference to the left frame of fig. 3C.

It will be appreciated that the split screen layout may also be performed in the manner described above when the number of target objects is greater. For example, when the number of target objects is 7, 3×3 sub-pictures may be employed to correspond to the target objects, wherein two sub-pictures may be left white, as shown in fig. 3E.

In some embodiments, considering that the proportion of a typical display screen is not square but longer than a wide rectangle (e.g., 16:9), it may be considered to increase the number of sub-pictures in the length direction when performing split-screen layout. For example, a video conference picture can be divided into 1 row and N column at most with a support number of 12, when N <5, the picture can be divided into 2 rows, when N is less than or equal to 8, the first row is N/2 column (can be rounded up or down), the second row is N-N/2 column (can be rounded down or up), when N is less than or equal to 12, the picture can be divided into 3 rows, the first two rows are divided into N/3 columns (can be rounded up or down), and the last row is N-N/3 x 2 column (can be rounded down or up). The width of each sub-picture in the last row is consistent with the width of the sub-picture in the previous row, and can be arranged in the middle. Fig. 3F shows a schematic diagram of yet another exemplary video conference screen according to an embodiment of the present disclosure. As shown in fig. 3F, when the number of target objects is 7, a layout of 4+3 can be split in the above manner.

Therefore, the video conference picture is divided into at least two sub-pictures according to the number of the target objects by detecting the target objects in the target image, and the corresponding target objects are correspondingly displayed in the sub-pictures, so that the video conference picture can be automatically split under the condition that a plurality of participants are contained in the acquired target image, the conference room participants and the on-line participants can move in the video conference process, and the user experience is further improved.

In some embodiments, when the split-screen layout is performed, besides the number of target objects, whether the participant is speaking or not can be detected, and then corresponding processing can be performed based on the detected sub-frames corresponding to the participant who is speaking. Thus, as shown in FIG. 2, the method 200 further includes a step 208 of performing speaker detection on the target object. Alternatively, this step may be processed in parallel with target object detection, thereby increasing the processing speed.

Alternatively, when it is determined that the target object in the target image is speaking (i.e., the participant corresponding to the target object is speaking), an indication identifier, for example, an icon for indicating that the participant is speaking, may be displayed in the sub-screen corresponding to the participant who is speaking. As shown in fig. 3D, an icon in a microphone pattern may be displayed in a sub-screen corresponding to the talking participant, thereby alerting others that the participant is talking.

As an alternative embodiment, the split screen layout may also be changed depending on whether the result of the talking participant is detected.

Optionally, if it is determined according to the detection result that the target object in the target image is speaking, the video conference frame may be divided into at least two sub-frames according to a first split-screen mode; if it is determined that all the participants of the video conference do not speak according to the detection result, dividing the video conference picture into at least two sub-pictures according to a second split screen mode. The second split screen mode may be any of the split screen layouts in the foregoing embodiments.

Further, dividing the video conference picture into at least two sub-pictures according to a first split-screen mode, including: and amplifying and displaying a first sub-picture in the at least two sub-pictures, and displaying other sub-pictures in the at least two sub-pictures on at least one side of the first sub-picture in parallel, wherein the first sub-picture can be used for displaying a speaking target object.

Fig. 3G shows a schematic diagram of yet another exemplary video conference screen 330, according to an embodiment of the present disclosure. As shown in fig. 3G, the screen 330 includes four sub-screens corresponding to the target objects 302A to 302C and the target object 312, respectively, wherein the first sub-screen 3302 is enlarged and corresponds to the target object 302A of the talking participant 110A, and the other sub-screens are displayed in parallel on one side of the first sub-screen 3302. Therefore, the sub-pictures of the speaking participant are arranged in the middle of the picture and occupy a larger picture, and the sub-pictures of the non-speaking participant are arranged on the side and occupy a smaller picture according to the speaker detection result, so that the interactivity can be improved better.

Thus, when a person speaks, a speaker mode (first split screen mode) is adopted in which the person currently speaking is placed on the largest sub-picture, and the rest of the participants are arranged side by side on at least one side of the largest sub-picture (may be placed on two sides or more when the number is large). When no person speaks, the common split screen mode (second split screen mode) is adopted, and each sub-picture has the same size. By selecting different split screen modes according to the speaker detection result, the video conference interactivity can be increased, and the user experience is improved.

Generally, video streams and audio streams of online participants are independent, in the related art, video conference software can confirm whether a corresponding video is speaking according to the audio streams, but people participating in a conference room share the video streams and the audio streams, and the audio streams collected from terminal devices on one side of the conference room cannot determine whether the people in the conference room speak or not, and cannot distinguish who is speaking at present, so that interactivity of the conference is reduced.

In view of this, in some embodiments, speaker detection is performed by way of image processing, which may avoid the problem of being unable to discern, via the audio stream, who is currently speaking.

As an alternative embodiment, the server 106 may perform keypoint detection on each detected target object, and then determine whether the participant corresponding to the target object in the target image is speaking according to the result of the keypoint detection.

Fig. 3H shows a schematic diagram of face key point detection.

As shown in fig. 3H, the face key point detection may employ a detection method of 68 key points, where the key points are distributed at various positions of the face, and 0-16 points correspond to chin, 16-21 points correspond to right eye brow (here, mirror image, right eye brow of a person in the figure), 22-26 points correspond to left eye brow, 27-35 points correspond to nose, 36-41 points correspond to right eye, 42-47 points correspond to left eye, and 48-67 points correspond to lips. The face can be identified by detecting the key points, and whether the corresponding participant is speaking can be determined according to the change of the key points in the continuous multi-frame target object.

It should be noted that the face key point detection method of 68 key points is only an example, and it is understood that the face key point detection may also have other key point numbers, for example, 21 key points, 29 key points, and so on.

As an alternative embodiment, 106 keypoints may be used to implement keypoint detection, so that a more accurate detection result can be obtained.

In some embodiments, after detecting keypoints of the target object; the lip height may be determined based on the keypoints of the target object, and the lip width may be determined based on the keypoints of the target object; and then, according to the lip height and the lip width, obtaining the lip height-width ratio of the target object, and further determining whether the target object is speaking or not based on the change information of the lip height-width ratio. In this way, the speaker detection is realized by using the image processing mode, and the problem that the current specific speaking person cannot be distinguished through the audio stream can be avoided.

Further, in some embodiments, the detected keypoints may be modified based on the angle of rotation of the face, while keypoint detection is performed, considering that rotation of the face may cause a change in the position of the lip keypoints.

Fig. 3I shows a schematic view of a rotated face.

As shown in fig. 3I, for a human face, there are three rotation angles, yaw angle (Yaw), roll angle (Roll), pitch angle (Pitch), in three-dimensional space.

As an alternative embodiment, the effect of Roll rotation can be cancelled by affine transformation, and then the effect of Pitch and Yaw rotation can be cancelled by the Pitch and Yaw information of face detection.

Specifically, a plurality of key points (e.g., coordinates of 106 key points) may be detected from the target object. And then, the key points of the target object are corresponding to the key points of the standard (average) face (namely, standard key points), so that an affine transformation matrix (mapping relation) can be obtained, based on the affine transformation matrix acting on the key points obtained by detecting the target object, the rolling angle (Roll) of the key points can be corrected, and the corrected key points, namely, the key point coordinates of the currently detected target object when roll=0, are obtained.

Further, a plurality of first key points and a plurality of second key points which correspond to the lip height and the lip width respectively in the plurality of corrected key points can be selected, pitch angle correction is performed on the plurality of first key points, the corrected lip height is obtained, yaw angle correction is performed on the plurality of second key points, and the corrected lip width is obtained, so that the corrected lip height and the corrected lip width counteract the influence of Pitch and Yaw rotation. As an alternative embodiment, taking 106 key points as an example, the 98-102 point line segment length may be calculated to represent lip height and divided by cos (Pitch) to offset the effect of Pitch, resulting in a corrected lip height. Similarly, the length of the line segment from 96 to 100 points is calculated to represent the lip width and divided by cos (Yaw) to cancel the effect of Yaw, resulting in a corrected lip width. The angle information of Pitch and Yaw may be provided by the face detection module.

Then, the key point detection result (detection result for the lip height and the lip width) can be obtained according to the corrected lip height and the corrected lip width, so that the width of the mouth is represented by the height-width ratio of the corrected lip.

Considering that speaking is a dynamic process, depending on the aspect ratio of the lips at the current time alone, it may not be possible to accurately determine whether the current participant is speaking. Thus, in some embodiments, a change in lip aspect ratio may be maintained over a period of time and a variance in aspect ratio over the period of time may be used to determine whether the subject is currently speaking.

As an alternative embodiment, the lip aspect ratio of the target object may be calculated and stored according to the corrected lip height and the corrected lip width.

Then, when determining whether the participant corresponding to the target object in the target image is speaking according to the key point detection result, determining change information corresponding to the lip aspect ratio of the target object in the target image according to the key point detection result, and determining whether the participant corresponding to the target object in the target image is speaking according to the change information. For example, whether to speak may be determined based on whether the variance of the lip aspect ratio over a preset period of time (e.g., within 1 s) is greater than a variance threshold, and if so, whether to speak may be determined. In this way, the result can be more accurate when judging whether or not speaking is being performed.

In order to stabilize the effect of the speaking test, in some embodiments, a counter may be maintained to record the number of times the speaker has been determined to be speaking in the last period of time, and then determine whether to speak according to the magnitude relation between the number of times and a preset count threshold. Optionally, determining whether the target object is speaking based on the change information of the lip aspect ratio includes: setting a preset period of time (e.g., within 2 s); counting the change times of the lip height-width ratio in the preset time period; and when the number of changes reaches a preset number, determining that the target object is speaking. In this way, the stability of speaker detection in time sequence is enhanced by using the time sequence information of the key points, and the fluctuation of the detection state is reduced.

As an alternative embodiment, in response to determining from the change information that the participant corresponding to the target object in the target image is speaking (i.e., the subject is determined to be speaking in the current frame), a count value of +1; and in response to determining from the change information that the participant corresponding to the target object in the target image is not speaking (i.e., the subject is determined not to be speaking in the current frame), counting value-1.

Then, it may be determined whether or not the participant corresponding to the target object in the target image is speaking based on the count value within a preset period of time (e.g., within 2 s). For example, when the value of the counter is greater than a preset count threshold (e.g., 2 times), an effect that the subject is speaking (e.g., enlarging a sub-picture corresponding to the speaker and/or displaying a microphone icon) may be displayed, and when the value of the counter is less than the preset count threshold, an effect that the subject is speaking may be canceled (e.g., restoring the sub-picture corresponding to the speaker to the same size as other sub-pictures and/or hiding the microphone icon) may be displayed.

Thus, the position of the lips of the face is determined according to the key point detection of the face after the split screen, and whether the current participant is speaking is judged according to the relative position relation of the key points of the lips; meanwhile, in order to improve misjudgment of speaking detection caused by actions such as face movement and rotation, key points of face detection are mapped to a standard face which is not rotated before speaking judgment is carried out, so that influence caused by face movement is reduced. In addition, the stability of speaker detection in time sequence is enhanced by utilizing the time sequence information of the key points, and fluctuation of detection states is reduced.

Considering that there may be an error in determining a speaker simply by using an image processing technique, in some embodiments, after determining that the number of changes reaches a preset number, audio data of a video conference may be further acquired, and then, according to a key point detection result, determining whether a target object in the target image is speaking in combination with the audio data of the video conference. Therefore, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the sound pickup for collecting the audio is a binaural sound pickup, the speaker can be positioned according to two sets of audio data collected by the binaural sound pickup, so that the accuracy of speaker judgment can be further improved.

In the related art, some automatic video conference split-screen software determines the position of a participant in a current picture through human body detection, and performs cutting and layout according to the human body position so as to realize the function of automatic split-screen of the software. When a participant enters or leaves the conference room, it is difficult to implement real-time split screen changes.

Therefore, in order to achieve the increase or decrease in the number of split screens in real time, after determining the split screen layout, as shown in fig. 2, step 210 may be entered to correspond the target object to the detection frame according to the detection frame position and the split screen layout.

According to the foregoing embodiments, the target object in the target image may be detected using a target detection or target tracking technique, resulting in a detection frame corresponding to the target object, e.g., detection frames 304A-304C of fig. 3B. Thus, each detected target object can be correspondingly provided with a detection frame, and then the position of each target object in the split-screen layout can be determined according to the split-screen layout, so that the detection frame of the target object is correspondingly arranged with the position in the split-screen layout. In this way, by detecting the target object 300, determining the relative positional relationship of different faces by the position of the detection frame of the target object, and associating the position of the split screen with the position of the face, and then associating the detection frame with the split screen layout, the number of split screens can be increased or decreased according to the detection result, and the layout of the split screens can be changed, so that the relative position of the original participant can be kept unchanged when a person joins or leaves the conference room, and the number of split screens can be increased or decreased in real time.

Alternatively, ROI (Region of Interest) corresponding to each participant in the split screen layout may be determined according to the coordinates of the detection frame of each participant in the original image (target image), so that the one-to-one correspondence of the person, the detection frame, and the sprite is determined according to the position of the detection frame of each person and the split screen layout.

Then, as shown in fig. 2, at step 212, the content of the detection frame may be matched with the sprite. Optionally, the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-picture corresponding to the target object may be determined first; and then, translating and/or scaling the image corresponding to the detection frame into the sub-picture corresponding to the target object according to the coordinates of the detection frame corresponding to the target object and the coordinates of the sub-picture corresponding to the target object. By scaling the target object, the participants can clearly see other people, and the interactivity of the conference is improved.

As an alternative embodiment, after each detection frame corresponds to the split-screen sub-picture, the detection frames are first expanded to two sides according to a certain proportion (for example, 20% of high expansion and 40% of wide expansion), on the basis, the width and the height of the detection frames are expanded to two sides until the aspect ratio is the same as that of the corresponding sub-screen, and if the expanded ROI exceeds the picture boundary, the detection frames are translated into the picture range, so that the matching of the content of the detection frames and the sub-picture can be realized.

In some embodiments, when the image corresponding to the detection frame is translated and scaled, the ROI of the sub-picture corresponding to each detection frame at the current moment can be calculated in a linear interpolation mode, and the scaling effect is smoothly moved from the current picture to the target face, so that the horizontal transformation-vertical transformation-scaling function similar to the monitoring camera during the screen separation effect switching is realized.

Specifically, assuming that the original coordinates of a certain sprite are represented as (x1_0, y1_0, x2_0, y2_0) by coordinates of two vertices on the left and right of a rectangle, the panning zoom duration is T, and the coordinates of the ROI of the detection frame of the target face are represented as (x1_t, y1_t, x2_t, y2_t) by coordinates of two vertices on the left and right of a rectangle:

first, a time interval Δt can be determined, and then the number of linear interpolations is determined from the pan scaling duration T and the time interval Δt.

And then, the number of the linear interpolation and the updated coordinates of the sub-picture corresponding to each interpolation are determined according to the original coordinates of the sub-picture and the coordinates of the ROI of the detection frame. For each interpolation, the updated coordinates may be equally spaced relative to the coordinates corresponding to the previous interpolation.

Then, according to the updated coordinates of the sub-picture corresponding to each interpolation, the sub-picture is gradually subjected to horizontal transformation-vertical transformation-scaling processing according to the mode of the equal time interval until the duration reaches T.

Therefore, from the opening of the split screen function to the completion of the translation scaling of the sub-picture, the transition time of the time T can be provided, and further, the horizontal conversion-vertical conversion-scaling function similar to that of the monitoring camera can be provided on the visual effect, so that the user experience is improved, and the problem that the switching effect is relatively hard due to the fact that the related technology is directly switched or the cut frames are simply scaled is solved.

In some embodiments, the resolution of the sub-frames processed in the above manner may be affected, so that the super-resolution technique may be used to increase the resolution of the sub-frames, thereby improving the resolution.

In the related art, a virtual background function of a sprite is not generally supported. Embodiments of the present disclosure provide virtual background functionality to fill in the gap.

Thus, as shown in FIG. 2, at step 214, a determination may be made as to whether to turn on the virtual background function. If the virtual background function is turned on, step 216 is entered, and the semantic segmentation based on the target object (which may be processed in parallel with the face detection to improve the processing efficiency) is performed on the current image, so that the virtual background function of each split screen is implemented by using the semantic segmentation capability. Alternatively, a pre-trained semantic segmentation model may be used to segment the target object from the background image. The semantic segmentation model may be a deep learning based real-time portrait semantic segmentation model, the model structure including, but not limited to, various forms of convolutional neural networks (Convolutional Neural Network) and various forms of Transformer networks.

In some embodiments, a portrait segmentation function may be applied to perform portrait segmentation on the entire current input map. The segmentation result of each sub-screen after the split screen function is started corresponds to the segmentation result of the full-image segmentation result under the ROI corresponding to the sub-screen. And then, for each pixel point of each sub-screen, calculating the pixel value of the virtual background result at the pixel point according to the value (normalized to [0,1 ]) of the segmentation result at the pixel point, the value corresponding to the pixel point in the input diagram, and the pixel value of the new background to be replaced at the pixel point.

Specifically, for each pixel point, a first value (normalized to [0,1 ]) of the segmentation result at the pixel point may be multiplied by a value corresponding to the pixel point in the input graph, and a second value (1 minus the first value) of the segmentation result at the pixel point may be added to multiply by a pixel value of the new background to be replaced at the pixel point, so as to obtain a pixel value of the virtual background result at the pixel point.

Thus, the process of replacing the real background with the virtual background is completed.

Next, step 218 may be entered to render the video conference screen.

Optionally, the sub-screen can be rendered by using the content of the original image ROI area corresponding to each sub-picture, if the virtual background function is started, the background replacement is performed by combining the result of the image segmentation and the background to be replaced, and if any special processing is required to be performed on the detected speaker, the processing is also completed in the step. For example, a target object corresponding to a participant who is speaking is displayed on the first sub-screen 3302. A virtual background is displayed in the second sub-screen 3304.

The current frame processing is then ended, the next frame flow is entered, and processing of the next frame may begin.

According to the embodiment, the video conference automatic split screen system is utilized, and the video automatic split screen function is utilized to realize the 'face-to-face' communication between a person sitting in the same conference room and a colleague of a remote participant. In some embodiments, speaker detection can conveniently identify who is speaking in the current conference room, and stream the video of the speaker to a distinct location, improving the videoconference experience. In some scenes, for example, when live broadcast is performed outdoors, sometimes, the ratio of the shooting subject in the picture may be very small due to the relationship of the place and the distance, and the PTZ function implemented by software can implement the function of automatically tracking and focusing the lens without any operation.

In the foregoing embodiment, the server 106 is described as the execution subject, and in practice, the foregoing processing steps may not be limited to the execution subject, and for example, the terminal devices 102 and 104 may implement the processing steps, and therefore, the terminal devices 102 and 104 may be the execution subject of the foregoing embodiment.

The embodiment of the disclosure also provides a split screen method of the video conference picture. Fig. 4 shows a flow diagram of an exemplary method 400 provided by an embodiment of the present disclosure. The method 400 may be applied to the server 106 of fig. 1A, as well as to the terminal devices 102, 104 of fig. 1A. As shown in fig. 4, the method 400 may further include the following steps.

In step 402, a target image acquired by an acquisition unit is acquired.

Taking fig. 1A as an example, the acquisition unit may be a camera provided in the terminal device 102, 104, and the target image may be images 1022 and 1042 acquired by the camera.

At step 404, a target object (e.g., target objects 302A-302C of FIG. 3A) in the target image (e.g., image 300 of FIG. 3A) is detected.

As an alternative embodiment, object Detection (Object Detection) techniques may be employed to detect the target Object in the target image 300. Alternatively, the target objects 302A to 302C in the target image 300 may be detected using a pre-trained target detection model, and detection frames 304A to 304C corresponding to the target objects 302A to 302C may be obtained, as shown in fig. 3B.

In step 406, the video conference picture is divided into at least two sub-pictures according to the target object.

In step 408, the target object in the target image is correspondingly displayed in the at least two sub-frames, as shown in fig. 3C to 3G.

According to the screen splitting method of the video conference picture, the video conference picture is divided into at least two sub-pictures according to the target objects in the target image acquired by the acquisition unit, and the corresponding target objects are correspondingly displayed in the sub-pictures, so that the video conference picture can be automatically split under the condition that a plurality of participants are contained in the acquired target image, and the interaction feeling of the conference room participants and other online participants can be improved in the video conference process, and further the user experience is improved.

In some embodiments, dividing the video conference picture into at least two sub-pictures according to the number of target objects comprises: determining whether a target object in the target image is speaking; in response to determining that a target object in the target image is speaking, the video conference screen is divided into at least two sub-screens in a first split-screen mode, as shown in fig. 3G. Thus, when a person speaks, a speaker mode (first split screen mode) is adopted in which the person currently speaking is placed on the largest sub-picture, and the rest of the participants are arranged side by side on at least one side of the largest sub-picture (may be placed on two sides or more when the number is large). When no person speaks, the common split screen mode (second split screen mode) is adopted, and each sub-picture has the same size. By selecting different split screen modes according to the speaker detection result, the video conference interactivity can be increased, and the user experience is improved.

In some embodiments, dividing the video conference picture into at least two sub-pictures according to a first split screen mode includes: a first sub-picture (e.g., sub-picture 3302 of fig. 3G) of the at least two sub-pictures is enlarged and displayed, and other sub-pictures of the at least two sub-pictures are displayed in parallel on at least one side of the first sub-picture, the first sub-picture being for displaying a target object being speaking.

Therefore, the sub-pictures of the speaking participant are arranged in the middle of the picture and occupy a larger picture, and the sub-pictures of the non-speaking participant are arranged on the side and occupy a smaller picture according to the speaker detection result, so that the interactivity can be improved better.

In some embodiments, determining whether a target object in the target image is speaking comprises: detecting key points of the target object; determining a lip height based on the keypoints of the target object; determining a lip width based on the keypoints of the target object; obtaining the lip height-width ratio of the target object according to the lip height and the lip width; based on the change information of the lip aspect ratio, whether the target object is speaking or not is determined, so that the speaker detection can be performed in an image processing mode, and the problem that the current specific speaking person cannot be distinguished through an audio stream can be avoided.

In some embodiments, determining whether a target object in the target image is speaking comprises: detecting key points of the target object; and determining whether the target object in the target image is speaking according to the key point detection result, so that the speaker detection can be performed in an image processing mode, and the problem that the current specific speaking person cannot be distinguished through the audio stream can be avoided.

In some embodiments, performing keypoint detection on the target object includes:

detecting a plurality of key points according to the target object;

according to the corresponding relation between the key points and the standard key points, carrying out rolling angle correction on the key points to obtain corrected key points;

selecting a plurality of first key points corresponding to the lip height from the corrected plurality of key points, and correcting pitch angles of the plurality of first key points to obtain corrected lip height;

selecting a plurality of second key points corresponding to the lip width from the corrected plurality of key points, and correcting yaw angles of the plurality of second key points to obtain corrected lip width;

and obtaining the key point detection result according to the corrected lip height and the corrected lip width.

In order to improve misjudgment of speaking detection caused by actions such as face movement and rotation, key points of the face detection are mapped to a standard face which is not rotated before the speaking judgment is carried out, so that influence caused by face movement is reduced.

In some embodiments, the method further comprises: calculating the lip height-width ratio of the target object according to the corrected lip height and the corrected lip width, and storing the lip height-width ratio;

determining whether a target object in the target image is speaking according to the key point detection result comprises the following steps: and determining the change information corresponding to the lip aspect ratio of the target object in the target image according to the key point detection result, and determining whether the target object in the target image is speaking or not according to the change information.

In this way, the stability of speaker detection in time sequence is enhanced by using the time sequence information of the key points, and the fluctuation of the detection state is reduced.

In some embodiments, determining whether a participant corresponding to a target object in the target image is speaking comprises: setting a preset time period; counting the change times of the lip height-width ratio in the preset time period; and when the number of changes reaches a preset number, determining that the target object is speaking. In this way, the stability of speaker detection in time sequence is enhanced by using the time sequence information of the key points, and the fluctuation of the detection state is reduced.

In some embodiments, when the number of changes reaches a preset number, determining that the target object is speaking comprises: determining that the number of the changes reaches a preset number; acquiring audio data of a video conference; in connection with audio data of a video conference, it is determined that the target object is speaking. Therefore, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the sound pickup for collecting the audio is a binaural sound pickup, the speaker can be positioned according to two sets of audio data collected by the binaural sound pickup, so that the accuracy of speaker judgment can be further improved.

In some embodiments, detecting a target object in the target image includes: detecting a target object in the target image by utilizing a target detection or target tracking technology to obtain a detection frame of the target object;

correspondingly displaying the target object in the target image in the at least two sub-pictures, wherein the method comprises the following steps: determining coordinates of a detection frame corresponding to the target object; determining the coordinates of the sub-picture corresponding to the target object; and translating and/or scaling the image corresponding to the detection frame into the sub-picture corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-picture.

By scaling the target object, the participants can clearly see other people, and the interactivity of the conference is improved.

In some embodiments, displaying the target object in the target image in the at least two sprites correspondingly further includes: responding to the fact that the virtual background function in the sub-picture is opened, and dividing the target object and the background by utilizing a dividing technology to obtain a dividing result; and displaying the virtual background in the sub-picture according to the segmentation result, and filling a technical blank point in the related art that the virtual background is not displayed in the sub-picture.

In some embodiments, displaying the target object in the target image in the at least two sprites includes: in response to determining that the virtual background function of a second sub-picture of the at least two sub-pictures is turned on, displaying a virtual background in the second sub-picture (e.g., sub-picture 3304 of fig. 3G) fills in a technical blank point in the related art where the virtual background is not displayed in the sub-picture.

In some embodiments, the method further comprises: dividing a target object in the target image and a background by utilizing a semantic division technology to obtain a division result;

Displaying a virtual background in the second sub-picture, including: and displaying a virtual background in the second sub-picture according to the segmentation result.

Therefore, the segmentation of the target object and the actual background is realized by utilizing the semantic segmentation technology, and the replacement of the virtual background is well realized.

In some embodiments, displaying the target object in the target image in the at least two sprites correspondingly further includes: and in response to determining that the target object is speaking, displaying an indication mark in the sub-picture corresponding to the target object, thereby reminding other people that the participant in the sub-picture corresponding to the icon is speaking and improving interactivity.

It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The disclosed embodiments also provide a computer device for implementing the above-described method 200 or 400. Fig. 5 shows a hardware architecture diagram of an exemplary computer device 500 provided by an embodiment of the present disclosure. The computer device 500 may be used to implement the server 106 of fig. 1A, as well as the terminal devices 102, 104 of fig. 1A. In some scenarios, the computer device 500 may also be used to implement the database server 108 of FIG. 1A.

As shown in fig. 5, the computer device 500 may include: processor 502, memory 504, network module 506, peripheral interface 508, and bus 510. Wherein the processor 502, the memory 504, the network module 506 and the peripheral interface 508 enable a communication connection therebetween within the computer device 500 via the bus 510.

The processor 502 may be a central processing unit (Central Processing Unit, CPU), an image processor, a neural Network Processor (NPU), a Microcontroller (MCU), a programmable logic device, a Digital Signal Processor (DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. The processor 502 may be used to perform functions related to the techniques described in this disclosure. In some embodiments, processor 502 may also include multiple processors integrated as a single logical component. For example, as shown in fig. 5, the processor 502 may include a plurality of processors 502a, 502b, and 502c.

Memory 504 may be configured to store data (e.g., instructions, computer code, etc.). As shown in fig. 5, the data stored by the memory 504 may include program instructions (e.g., program instructions for implementing the methods 200 or 400 of embodiments of the present disclosure) as well as data to be processed (e.g., the memory may store configuration files of other modules, etc.). The processor 502 may also access program instructions and data stored in the memory 504 and execute the program instructions to perform operations on the data to be processed. Memory 504 may include volatile storage or nonvolatile storage. In some embodiments, memory 504 may include Random Access Memory (RAM), read Only Memory (ROM), optical disks, magnetic disks, hard disks, solid State Disks (SSD), flash memory, memory sticks, and the like.

The network interface 506 may be configured to provide the computer device 500 with communications with other external devices via a network. The network may be any wired or wireless network capable of transmitting and receiving data. For example, the network may be a wired network, a local wireless network (e.g., bluetooth, wiFi, near Field Communication (NFC), etc.), a cellular network, the internet, or a combination of the foregoing. It will be appreciated that the type of network is not limited to the specific examples described above.

Peripheral interface 508 may be configured to connect computer apparatus 500 with one or more peripheral devices to enable information input and output. For example, the peripheral devices may include input devices such as keyboards, mice, touchpads, touch screens, microphones, various types of sensors, and output devices such as displays, speakers, vibrators, and indicators.

Bus 510 may be configured to transfer information between the various components of computer device 500 (e.g., processor 502, memory 504, network interface 506, and peripheral interface 508), such as an internal bus (e.g., processor-memory bus), an external bus (USB port, PCI-E bus), etc.

It should be noted that, although the architecture of the computer device 500 described above illustrates only the processor 502, the memory 504, the network interface 506, the peripheral interface 508, and the bus 510, in a specific implementation, the architecture of the computer device 500 may also include other components necessary to achieve proper operation. Moreover, those skilled in the art will appreciate that the architecture of the computer device 500 described above may include only the components necessary to implement the disclosed embodiments, and not all of the components shown in the figures.

The embodiment of the disclosure also provides an interaction device. Fig. 6 shows a schematic diagram of an exemplary apparatus 600 provided by an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 may be used to implement the method 200 or 400, and may further include the following modules.

An acquisition module 602 configured to: and acquiring a target image acquired by the acquisition unit.

A detection module 604 configured to: target objects (e.g., target objects 302A-302C of fig. 3A) in the target image (e.g., image 300 of fig. 3A) are detected.

A partitioning module 606 configured to: dividing the video conference picture into at least two sub-pictures according to the target object.

A display module 608 configured to: and correspondingly displaying the target object in the target image in the at least two sub-pictures.

In some embodiments, the partitioning module 606 is configured to: determining whether a target object in the target image is speaking; in response to determining that a target object in the target image is speaking, the video conference screen is divided into at least two sub-screens in a first split-screen mode, as shown in fig. 3G. Thus, when a person speaks, a speaker mode (first split screen mode) is adopted in which the person currently speaking is placed on the largest sub-picture, and the rest of the participants are arranged side by side on at least one side of the largest sub-picture (may be placed on two sides or more when the number is large). When no person speaks, the common split screen mode (second split screen mode) is adopted, and each sub-picture has the same size. By selecting different split screen modes according to the speaker detection result, the video conference interactivity can be increased, and the user experience is improved.

In some embodiments, the partitioning module 606 is configured to: displaying a first sub-picture (e.g., sub-picture 3302 of fig. 3G) of the at least two sub-pictures in an enlarged manner, and displaying other sub-pictures of the at least two sub-pictures in parallel on at least one side of the first sub-picture, wherein the first sub-picture is used for displaying a speaking target object;

a display module 608 configured to: and displaying a target object corresponding to the talking participant in the first sub-picture.

In some embodiments, the detection module 604 is configured to: detecting key points of the target object; determining a lip height based on the keypoints of the target object; determining a lip width based on the keypoints of the target object; obtaining the lip height-width ratio of the target object according to the lip height and the lip width; based on the change information of the lip aspect ratio, whether the target object is speaking or not is determined, so that the speaker detection can be performed in an image processing mode, and the problem that the current specific speaking person cannot be distinguished through an audio stream can be avoided.

In some embodiments, the detection module 604 is configured to: detecting key points of the target object; and determining whether the target object in the target image is speaking according to the key point detection result, so that the speaker detection can be performed in an image processing mode, and the problem that the current specific speaking person cannot be distinguished through the audio stream can be avoided.

In some embodiments, the detection module 604 is configured to:

detecting a plurality of key points according to the target object;

In some embodiments, the detection module 604 is configured to: calculating the lip height-width ratio of the target object according to the corrected lip height and the corrected lip width, and storing the lip height-width ratio;

and determining the change information corresponding to the lip aspect ratio of the target object in the target image according to the key point detection result, and determining whether a participant corresponding to the target object in the target image is speaking or not according to the change information.

In some embodiments, the detection module 604 is configured to: setting a preset time period; counting the change times of the lip height-width ratio in the preset time period; and when the number of changes reaches a preset number, determining that the target object is speaking.

In some embodiments, the detection module 604 is configured to: when the number of changes reaches a preset number, determining that the target object is speaking includes: determining that the number of the changes reaches a preset number; acquiring audio data of a video conference; in connection with audio data of a video conference, it is determined that the target object is speaking. Therefore, the accuracy of speaker judgment can be further improved by combining the key point detection result and the audio data of the current video conference. As an optional embodiment, when the sound pickup for collecting the audio is a binaural sound pickup, the speaker can be positioned according to two sets of audio data collected by the binaural sound pickup, so that the accuracy of speaker judgment can be further improved.

In some embodiments, the detection module 604 is configured to: detecting a target object in the target image by utilizing a target detection or target tracking technology to obtain a detection frame corresponding to the target object;

a display module 608 configured to: determining coordinates of a detection frame corresponding to the target object; determining the coordinates of the sub-picture corresponding to the target object; and translating and/or scaling the image corresponding to the detection frame into the sub-picture corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-picture.

In some embodiments, the display module 608 is configured to: responding to the fact that the virtual background function in the sub-picture is opened, and dividing the target object and the background by utilizing a dividing technology to obtain a dividing result; and displaying the virtual background in the sub-picture according to the segmentation result, and filling a technical blank point in the related art that the virtual background is not displayed in the sub-picture.

In some embodiments, the display module 608 is configured to: in response to determining that the virtual background function of a second sub-picture of the at least two sub-pictures is turned on, displaying a virtual background in the second sub-picture (e.g., sub-picture 3304 of fig. 3G) fills in a technical blank point in the related art where the virtual background is not displayed in the sub-picture.

In some embodiments, the display module 608 is configured to: dividing a target object in the target image and a background by utilizing a semantic division technology to obtain a division result; and displaying a virtual background in the second sub-picture according to the segmentation result. Therefore, the segmentation of the target object and the actual background is realized by utilizing the semantic segmentation technology, and the replacement of the virtual background is well realized.

In some embodiments, the display module 608 is configured to: and in response to determining that the target object is speaking, displaying an indication mark in the sub-picture corresponding to the target object, thereby reminding other people that the participant in the sub-picture corresponding to the icon is speaking and improving interactivity.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the various modules may be implemented in the same one or more pieces of software and/or hardware when implementing the present disclosure.

The apparatus of the foregoing embodiments is configured to implement the corresponding method 400 in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, corresponding to any of the above-described embodiments of the method, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method 400 as described in any of the above-described embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the method 200 or 400 described in any of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, the present disclosure also provides a computer program product, corresponding to any of the embodiment methods 200 or 400 described above, comprising a computer program. In some embodiments, the computer program is executable by one or more processors to cause the processors to perform the described method 200 or 400. Corresponding to the execution bodies corresponding to the steps in the embodiments of the method 200 or 400, the processor executing the corresponding step may belong to the corresponding execution body.

The computer program product of the above embodiment is configured to cause a processor to perform the method 400 of any of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in details for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present disclosure. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present disclosure, and this also accounts for the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present disclosure are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the disclosure, are intended to be included within the scope of the disclosure.

Claims

1. A split screen method for video conference pictures, comprising:

acquiring a target image acquired by an acquisition unit;

detecting a target object in the target image;

2. The method of claim 1, wherein dividing the video conference picture into at least two sub-pictures according to the target object, further comprises:

determining whether a target object in the target image is speaking;

in response to determining that the target object is speaking, the video conference screen is divided into at least two sub-screens in a first split screen mode.

3. The method of claim 2, wherein dividing the video conference picture into at least two sub-pictures in a first split-screen mode comprises:

Magnifying and displaying a first sub-picture in the at least two sub-pictures, wherein the first sub-picture is used for displaying a speaking target object;

and displaying other sub-pictures in the at least two sub-pictures in parallel on at least one side of the first sub-picture.

4. The method of claim 2, wherein determining whether a target object in the target image is speaking comprises:

detecting key points of the target object;

determining a lip height based on the keypoints of the target object;

determining a lip width based on the keypoints of the target object;

obtaining the lip height-width ratio of the target object according to the lip height and the lip width;

based on the information of the change in the lip aspect ratio, it is determined whether the target object is speaking.

5. The method of claim 4, wherein determining whether the target object is speaking based on the change information of the lip aspect ratio comprises:

setting a preset time period;

counting the change times of the lip height-width ratio in the preset time period;

and when the number of changes reaches a preset number, determining that the target object is speaking.

6. The method of claim 5, wherein determining that the target object is speaking when the number of changes reaches a preset number comprises:

Determining that the number of the changes reaches a preset number;

acquiring audio data of a video conference;

in connection with audio data of a video conference, it is determined that the target object is speaking.

7. The method of claim 1, wherein detecting a target object in the target image comprises:

detecting a target object in the target image by utilizing a target detection or target tracking technology to obtain a detection frame of the target object;

correspondingly displaying the target object in the target image in the at least two sub-pictures, wherein the method comprises the following steps:

determining coordinates of a detection frame corresponding to the target object;

determining the coordinates of the sub-picture corresponding to the target object;

and translating and/or scaling the image corresponding to the detection frame into the sub-picture corresponding to the target object according to the coordinates of the detection frame and the coordinates of the sub-picture.

8. The method of claim 1, wherein displaying the target object correspondence in the target image in the at least two sprites further comprises:

responding to the fact that the virtual background function in the sub-picture is opened, and dividing the target object and the background by utilizing a dividing technology to obtain a dividing result;

And displaying a virtual background in the sub-picture according to the segmentation result.

9. The method of claim 1, wherein displaying the target object correspondence in the target image in the at least two sprites further comprises:

and in response to determining that the target object is speaking, displaying an indication identifier in a sub-picture corresponding to the target object.

10. A split screen device for video conference pictures, comprising:

11. A computer device comprising one or more processors, memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method of any of claims 1-9.

12. A non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of any of claims 1-9.

13. A computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-9.