CN114419694A

CN114419694A - Processing method and processing device for head portrait of multi-person video conference

Info

Publication number: CN114419694A
Application number: CN202111571415.0A
Authority: CN
Inventors: 肖兵; 王文熹; 李春
Original assignee: Zhuhai Shixi Technology Co Ltd
Current assignee: Zhuhai Shixi Technology Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-29

Abstract

The embodiment of the application discloses a processing method and a processing device for a multi-person video conference head portrait, which are used for autonomously replacing a virtual head portrait for participants through behavior states of the participants in a video multi-person conference scene, so that the experience of a user participating in a video conference is improved. The method in the embodiment of the application comprises the following steps: acquiring a first video of a multi-person video conference scene, and acquiring a first video frame through the first video; determining hand gesture information of the participant in the first video frame through a human gesture detection algorithm; determining the face positions of the participants according to the hand gesture information; determining the type of a control request initiated by a participant according to the hand gesture information, and executing corresponding operation on the face position according to the type of the control request; when the control request type initiated by the participant is determined not to be the virtual avatar removal type, detecting the behavior state of the participant in real time; and when the behavior state is detected to meet the first preset condition, executing the operation of replacing the virtual head portrait for the participant.

Description

Processing method and processing device for head portrait of multi-person video conference

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to a processing method and a processing device for a multi-person video conference head portrait.

Background

In recent years, video conference systems have become an indispensable part of information development of many enterprises, and in the aspects of remote example conference, cooperative office, remote training and the like, when problems cannot be solved through teleconference, mail or instant message during work, people are required to communicate with each other, so that video conference becomes an important way for solving the communication problems. Video conferencing, also known as video conferencing, is a way to hold a conference through multimedia devices and a communication network. When a conference is held, the participants at a plurality of different places can not only hear the sound of the opposite side, but also see the image of the opposite side, and also see the scene of the conference room of the opposite side, and the real objects, pictures, files and the like displayed in the conference room, thereby reducing the distance between the participants and further completing the purpose of the conference.

The technical principle of the video conference is that images, sounds, characters and the like are converted into digital signals at a sending end, the digital signals are transmitted to a receiving end through a communication network after being compressed and encoded, and the signals are restored into audio-visual signals capable of being received at the receiving end.

However, in a multi-person video conference scene, some participants do not want to "look at the scene" due to the consideration of makeup and the like, but the existing video conference software capable of setting the virtual head portrait cannot meet the requirement. Firstly, these video conference systems usually only aim at a single meeting scene, namely, the head portrait can be changed for only one person in the picture, and multiple persons cannot be considered; secondly, the user is required to perform corresponding manual operation on the system for replacing the head portrait, and the interaction mode is very inconvenient for a multi-person conference scene in which the participants are far away from the screen.

Disclosure of Invention

The embodiment of the application provides a processing method and a processing device for a multi-person video conference head portrait, which are used for automatically replacing a virtual head portrait for participants through behavior states of the participants without manual operation of a user on a video conference system in a video multi-person conference scene, so that the experience of the user when the user participates in a video conference is improved.

In a first aspect of the present application, a method for processing a multi-person video conference avatar is provided, including:

acquiring a first video of a multi-person video conference scene, and acquiring a first video frame through the first video, wherein the first video frame comprises face and hand pictures of all participants in the conference scene;

determining hand gesture information of the participant in the first video frame through a human gesture detection algorithm, wherein the hand gesture information is gesture information matched with a predefined gesture;

determining the face position of the participant according to the hand gesture information;

determining a control request type initiated by the participant according to the hand gesture information, and executing corresponding operation on the face position according to the control request type, wherein the control request type comprises adding a virtual head portrait, replacing the virtual head portrait and removing the virtual head portrait;

when the control request type initiated by the participant is determined not to be the virtual avatar removal type, detecting the behavior state of the participant in real time;

and when the behavior state is detected to meet a first preset condition, executing operation of replacing the virtual head portrait for the participant.

Optionally, the behavior state includes a hand motion state of the participant, and the first preset condition is a preset frequency;

when the behavior state is detected to meet a first preset condition, executing operation of replacing the virtual head portrait for the participant, wherein the operation comprises the following steps:

judging whether the hand action frequency of the participants meets a preset frequency;

and if so, executing the operation of replacing the virtual head portrait for the participant.

Optionally, the behavior state includes a sound state of the participant, and the first preset condition is a preset decibel;

judging whether the decibel of the sound emitted by the participant meets a preset decibel or not;

Optionally, after the operation of replacing the virtual avatar is performed for the participant, the processing method further includes:

and controlling the virtual head portrait to move along with the movement of the position of the human face.

Optionally, the hand gesture information includes a hand gesture number;

determining the face positions of the participants according to the hand gesture information comprises:

judging whether the hand gesture number reaches a first preset value or not;

if the hand gesture quantity is determined not to reach a first preset value, determining the face position of the participant in the first video frame through a face detection algorithm;

and if the number of the hand gestures reaches a first preset value, determining human bodies of the participants in the areas corresponding to the hand gestures respectively through a human body detection algorithm, and performing human face detection on the human bodies of the participants in the corresponding areas to obtain the human face positions of the participants.

Optionally, the hand gesture information includes a hand gesture distance;

judging whether the hand gesture distance reaches a second preset value;

if the hand gesture distance is determined to reach a second preset value, determining the face position of the participant in the first video frame through a face detection algorithm;

and if the hand gesture distance is determined not to reach the second preset value, determining human bodies of the participants in the areas corresponding to the hand gestures respectively through a human body detection algorithm, and performing human face detection on the human bodies of the participants in the corresponding areas to obtain the human face positions of the participants.

Optionally, before determining the type of the control request initiated by the participant according to the hand gesture information, the processing method further includes:

determining the face key points of the participants according to the face position information;

and determining a human face acting region according to the human face key points.

Optionally, the predefined gesture is stored in a database;

the corresponding operation is executed on the face position according to the control request type, and the corresponding operation comprises at least one of the following conditions:

when the control request type is adding a virtual head portrait, setting any virtual head portrait in a database to fill the human face acting area;

when the control request type is to replace the virtual head portrait, setting any virtual head portrait except the original virtual head portrait in a database to fill the human face acting area;

and when the control request type is to remove the virtual head portrait, hiding the virtual head portrait filled in the human face action area.

The present application provides, from a second aspect, a processing apparatus for a multi-person video conference avatar, comprising:

the first acquisition unit is used for acquiring a first video of a multi-person video conference scene and acquiring a first video frame through the first video, wherein the first video frame comprises face and hand pictures of all participants in the conference scene;

the first determining unit is used for determining hand gesture information of the participant in the first video frame through a human gesture detection algorithm, wherein the hand gesture information is gesture information matched with a predefined gesture;

the second determining unit is used for determining the face position of the participant according to the hand gesture information;

a third determining unit, configured to determine a control request type initiated by the participant according to the hand gesture information, and perform corresponding operation on the face position according to the control request type, where the control request type includes adding a virtual avatar, replacing the virtual avatar, and removing the virtual avatar;

the behavior detection unit is used for detecting the behavior state of the participant in real time when the third determination unit determines that the type of the control request initiated by the participant is not the virtual avatar removal;

and the first execution unit is used for executing the operation of replacing the virtual head portrait for the participant when the behavior detection unit detects that the behavior state meets a first preset condition.

the first execution unit includes:

the first judgment module is used for judging whether the hand action frequency of the participant meets a preset frequency;

and the second execution module is used for executing the operation of replacing the virtual head portrait for the participant when the first judgment module determines that the hand action frequency of the participant meets the preset frequency.

the first execution unit includes:

the second judgment module is used for judging whether the decibel of the sound emitted by the participant meets the preset decibel;

and the third executing module is used for executing the operation of replacing the virtual head portrait for the participant when the second judging module determines that the decibel of the sound emitted by the participant meets the preset decibel.

Optionally, the processing apparatus further includes:

and the movement control unit is used for controlling the virtual head portrait to move along with the movement of the position of the human face.

Optionally, the hand gesture information includes a hand gesture number;

the second determination unit includes:

the third judgment module is used for judging whether the hand gesture number reaches a first preset value or not;

the fourth execution module is used for determining the face position of the participant in the first video frame through a face detection algorithm when the third judgment module determines that the hand gesture quantity does not reach the first preset value;

and the fifth execution module is used for respectively determining the human bodies of the participants in the areas corresponding to the hand gestures through a human body detection algorithm and carrying out human face detection on the human bodies of the participants in the corresponding areas to acquire the human face positions of the participants when the third judgment module determines that the number of the hand gestures reaches the first preset value.

Optionally, the hand gesture information includes a hand gesture distance;

the second determination unit includes:

the fourth judgment module is used for judging whether the hand gesture distance reaches a second preset value or not;

a sixth executing module, configured to determine, by a face detection algorithm, a face position of the participant in the first video frame when the fourth determining module determines that the hand gesture interval reaches a second preset value;

and the seventh execution module is used for respectively determining the human bodies of the participants in the areas corresponding to the hand gestures through a human body detection algorithm and carrying out human face detection on the human bodies of the participants in the corresponding areas to acquire the human face positions of the participants when the fourth judgment module determines that the hand gesture distance does not reach the second preset value.

Optionally, the processing apparatus further includes:

the fourth determining unit is used for determining the face key points of the participants according to the face position information;

and the fifth determining unit is used for determining the human face acting area according to the human face key points.

Optionally, the predefined gesture is stored in a database;

the third determining unit includes at least one of the following conditions:

when the control request type is to replace the virtual head portrait, the method is used for setting any virtual head portrait except the original virtual head portrait in a database to fill the human face acting area;

and when the control request type is to remove the virtual avatar, hiding the virtual avatar filled in the human face action area.

From a third aspect, the present application provides a processing apparatus for a multi-person video conference avatar, comprising:

the device comprises a processor, a memory, an input and output unit and a bus;

the processor is connected with the memory, the input and output unit and the bus;

the memory holds a program that the processor calls to perform the steps of the processing method according to the first aspect and any of the first aspects.

A computer readable storage medium having a program stored thereon, the program, when executed on a computer, performing any of the optional processing methods of the first aspect and the preceding aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

according to the method, a first video frame is obtained according to a first video in a multi-person video conference scene, hand gesture information of participants in the first video frame is determined through a human gesture detection algorithm, face positions of the participants are determined according to the determined hand gesture information, control request types initiated by the participants are determined according to the hand gesture information, corresponding processing is determined for the faces of the participants according to the control request types, when the control request types initiated by the participants do not remove virtual head portraits, behavior states of the participants are detected in real time, and when the behavior states meet a first preset condition, operation of replacing the virtual head portraits is performed for the participants. Through the technical means of changing and processing the head portrait in the video conference scene of the participants according to the behavior states of the participants, the interestingness of the users in the process of participating in the video conference is improved, and the experience of the video conference of the users is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an embodiment of a method for processing a multi-person video conference avatar in an embodiment of the present application;

fig. 2 is a schematic flowchart illustrating a processing method of a multi-person video conference avatar according to another embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an embodiment of a device for processing a multi-person video conference avatar in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a processing apparatus for a multi-person video conference avatar according to another embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of another embodiment of a processing device for a multi-person video conference avatar in an embodiment of the present application.

Detailed Description

The video conference system is a remote communication carrier which utilizes communication technologies such as a network to carry out media transmission and enables information such as videos and audios of people to be transmitted through the network. The system can collect people scattered in different regions and positioned at each decision level into a virtual space, shorten the distance, accelerate the communication and propagation of information and knowledge, promote team cooperation, catalyze decision speed, improve the working efficiency and greatly reduce the cost.

However, for a multi-person video conference scene, some participants do not want to "look at the scene" due to considerations such as makeup, but the existing video conference software capable of setting a virtual head portrait cannot meet the requirement. Firstly, these video conference systems usually only aim at a single meeting scene, namely, the head portrait can be changed for only one person in the picture, and multiple persons cannot be considered; secondly, the user is required to perform corresponding manual operation on the system for replacing the head portrait, and the interaction mode is very inconvenient for a multi-person conference scene in which the participants are far away from the screen.

Based on the above, the application provides a processing method and a processing device for a multi-person video conference head portrait, which are applied to a multi-person video conference scene, a first video frame is obtained according to a first video in the multi-person video conference scene, then hand gesture information of participants in the first video frame is determined through a human gesture detection algorithm, then the face positions of the participants are determined according to the determined hand gesture information, the control request types initiated by the participants are determined according to the hand gesture information, corresponding processing is performed on the faces of the participants according to the control request types, when the control request types initiated by the participants are not virtual head portraits removed, the behavior states of the participants are detected in real time, and when the behavior states meet a first preset condition, operation of replacing the virtual head portraits is performed for the participants. The aim of correspondingly replacing the head portraits of the participants through the behavior states of the participants can be fulfilled without manually operating a video conference system in a multi-user video conference scene.

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, in a first aspect, the present application provides a method for processing a multi-user video conference avatar, where the method may be implemented in a system, a server, or a terminal, and is not specifically limited. For convenience of description, the embodiment of the present application uses the system as an execution subject for example description. The method comprises the following steps:

101. the method comprises the steps that a system obtains a first video of a multi-person video conference scene, and obtains a first video frame through the first video, wherein the first video frame comprises face and hand pictures of all participants in the conference scene;

specifically, a camera device arranged on a conference display device, a conference table or a conference screen records a multi-person video conference scene in real time, and the system receives video data recorded by the camera device through each camera device coupled to the system to process the video data, wherein the processing can be compression or encoding and the like. The processed video data can be synchronously displayed, for example, when a meeting is opened, the processed video can be synchronously displayed on a screen at a specific position in a meeting room.

In the scene of a multi-person video conference, when the video conference is started, video recording is carried out simultaneously in a conference room, in the process of the conference, when a participant does not want to 'go to the mirror' due to factors such as makeup or special reasons, certain hiding processing needs to be carried out on the human face of the participant, and before the human face of the participant is processed, the position of the human face needs to be identified.

102. The system determines hand gesture information of the participant in the first video frame through a human gesture detection algorithm;

in this embodiment of the application, in order to detect the hand gesture state of the participant, the system needs to further identify the hand gesture information of the participant by using a human gesture detection algorithm.

In this embodiment of the present application, the system may determine the gesture of the participant first, and perform further processing on the participant when the gesture meets the trigger condition of the system for processing the avatar, that is, when the gesture matches the predefined gesture, where the processing includes, for example, determination of the human body position of the participant, determination of the human face position, and the like. In the step of determining the hand gesture through the human gesture detection algorithm, the human gesture detection algorithm may be various, for example, an initial gesture detection model may be established based on a neural network, and then the model may be trained using a small number of labeled data sets containing the hand key points. The method specifically comprises the steps of shooting a human hand by using a high-definition camera with a plurality of preset different visual angles, primarily detecting key points of the human hand by using a gesture detection model, constructing triangles of the key points according to the positions of the camera to obtain 3D positions of the key points, re-projecting the calculated 3D positions to 2D images with different visual angles, marking the training detection model network by using the 2D images and the key points, obtaining a relatively accurate hand detection model after several iterations, and inputting a first video frame to the hand detection model to determine hand gesture information. Of course, identification may also be performed directly using a TOF depth camera.

Furthermore, the hand gesture information may include various information related to gestures of the participant, for example, intervals between recognized effective gestures, the number of recognized effective gestures, recognized effective gesture actions, and the like, where the effective gestures referred to herein are gestures that meet the trigger condition, and the gestures that do not meet the trigger condition may be ignored and not counted, so as to simplify the analysis steps of the system on the participant, thereby reducing the analysis time of the system on the participant, where the predefined gestures are gestures that can trigger corresponding conditions, and the predefined gestures may be stored in the database.

103. The system determines the face positions of the participants according to the hand gesture information;

in the embodiment of the application, after the system acquires the first video frame within the time of the multi-person video conference, the face position of the participant in the conference can be determined through the first video frame, so as to lock the face processable area of the participant. It should be noted that, in order to better correspond the hand gesture to the face of the participant one by one, a certain area near the hand gesture may be set as a face/human body recognition detection area.

For example, after the system determines the hand gesture information, a certain area near the hand gesture is further determined, a partial image displaying the area is intercepted and input into the OpenCV library, and therefore the position coordinate information of the faces of the participants in the multi-person conference scene is recognized. OpenCV is an open source function library for image processing, analysis and machine vision, and is optimized by using C language, which includes hundreds of visual algorithms.

Optionally, when the number of the participators reaches a certain number, the system can further perform recognition detection on a human body region connected with an effective hand gesture through a human body detection algorithm, then intercept the upper half part of the human body region, and input the upper half part of the human body region into a pre-constructed neural network model capable of recognizing facial features to perform face position tracking, so that computational power consumption can be greatly reduced. Specifically, the method for determining the face position information of the relevant participant in the first video frame is not limited herein.

Furthermore, after the system determines the face position, a certain association needs to be established between the hand gesture and the face position, so that the subsequent system can operate the corresponding face according to the gestures of the participants.

Embodiments of associating the face position information and the hand gesture information specifically include, but are not limited to: after the system acquires the face position information and the related hand gesture information, a unique ID is generated for the participant, the ID is respectively bound with the gesture of the participant and the face position information, and a gesture accumulation record is generated.

104. The system determines the type of a control request initiated by the participant according to the hand gesture information, and executes corresponding operation on the face position according to the type of the control request;

in the embodiment of the present application, the control request type and the hand gesture information may be in one-to-one correspondence, or multiple types of hand gesture information may all correspond to one control request type. The corresponding relation between the hand gesture and the control request type is preset, and the control request type can be virtual avatar addition, virtual avatar removal or virtual avatar replacement. When the system identifies the hand gesture information of the participant, the hand gesture information can be matched with the corresponding gestures of each control request type, so that the control request type sent by the participant is determined, and corresponding operation is executed according to the face position corresponding to the determined control request type.

For example, a hand gesture may be set to expose a particular number of fingers to correspond to different control request types: if the gesture is detected to be that two fingers are erected, namely the type of the control request is to add a virtual head portrait; if the gesture is detected to be the erection of three fingers, the type of the control request is the replacement of the virtual head portrait; if the gesture is detected to be the erection of five fingers, the control request type is to remove the virtual avatar. Of course, the hand gesture may also be set such that lifting the hand and shaking it in front of the face adds the virtual avatar, and shaking it two times removes the virtual avatar. The specific arrangement is not limited herein.

Furthermore, the feature of adding/replacing the virtual head portrait is to mask the original appearance or increase the interest by changing the display effect of the head portrait of the participant in the picture, so the specific processing methods include but are not limited to: the method comprises the steps of performing style migration on the head portrait to achieve an abstract or cartoon effect, applying a 2D or 3D virtual head portrait to the head portrait, applying a filter special effect to the head portrait, applying a sticker special effect to the head portrait, directly pasting the head portrait, performing fuzzy processing on the head portrait, performing distortion processing on the head portrait and the like.

105. When the control request type initiated by the participant is determined not to be the virtual avatar removal type, the system detects the behavior state of the participant in real time;

in the embodiment of the application, when the participant has the virtual avatar, the system can actively replace the virtual avatar meeting the current emotion for the participant by detecting the emotion change of the participant in the conference scene, except that the virtual avatar is passively replaced according to the initiated gesture for replacing the virtual avatar by the participant, so that the system needs to detect the behavior state of the participant in real time to judge whether the emotion of the current participant is calm or excited according to the behavior state.

For example, the behavior state of the person to be detected set by the system may be a change in the movement speed of the limbs of the participant, a change in the size of the speech sound, or the like.

106. When the behavior state is detected to meet the first preset condition, the system executes the operation of replacing the virtual head portrait for the participant.

In the embodiment of the application, when the system detects that the behavior state of the participant meets the first preset condition, the emotion of the participant is determined to be excited, and the virtual avatar more conforming to the current emotion is automatically replaced for the participant, so that the interestingness of the conference is increased.

The first preset condition may be that the detected limb movement of the participant reaches a certain speed per minute, or that the speaking voice reaches a certain decibel, and the like, and is not limited specifically. When the system detects that the behavior state of the participant meets the first preset conditions, the virtual head portrait is replaced for the participant, for example, the virtual head portrait of the participant is originally a little girl head portrait without facial expression, the speaking decibel is 40 decibels, when the system detects that the speaking decibel of the participant reaches 70 decibels, the current emotional rise of the participant is determined, and the little girl head portrait without facial expression is automatically replaced by any cartoon character head portrait with head fire.

In the embodiment of the application, the system can correlate the face position information of the user participating in the video conference with the gesture performed by the user, determine the type of the control request initiated by the user through the gesture recognition, then correspondingly process the face correlated with the gesture according to the type of the control request, and detect the behavior state of the participant in real time when the control request initiated by the participant is determined not to remove the virtual head portrait; when the behavior state is detected to meet the first preset condition, the operation of replacing the virtual head portrait is executed for the participants, and the interestingness of the video conference and the experience of the user in the process of participating in the video conference are improved.

Referring to fig. 2, according to a first aspect of the present application, another method for processing a multi-person video conference avatar is provided, where the method may be implemented in a system, a server, or a terminal, and is not specifically limited. For convenience of description, the embodiment of the present application uses the system as an execution subject for example description. The method comprises the following steps:

201. acquiring a first video of a multi-person video conference scene, and acquiring a first video frame through the first video, wherein the first video frame comprises face and hand pictures of all participants in the conference scene;

202. the system determines hand gesture information of the participant in the first video frame through a human gesture detection algorithm;

steps 201 to 202 in this embodiment are similar to steps 101 to 102 in the previous embodiment, and are not described again here.

203. The system determines the face positions of the participants according to the hand gesture information;

in the embodiment of the application, after the system acquires the first video frame within the time of the multi-person video conference, the face position of the participant in the conference can be determined through the first video frame, so as to lock the face processable area of the participant.

Furthermore, the system can determine the adopted face position determination method according to the recognized hand gesture condition so as to reduce computational power loss to a certain extent. Two specific ways of determining the face position of the participant according to the hand gesture information will be described in detail below.

It should be noted that the human body detection algorithm mentioned here may be any human body gesture recognition algorithm, and the human face detection algorithm may be any human face recognition algorithm, such as a eigenface method, a local binary pattern algorithm, and the like, and is not limited here.

When the hand gesture information comprises the hand gesture number, the system judges whether the hand gesture number reaches a first preset value, if the hand gesture number is determined not to reach the first preset value, the face position of a participant in a first video frame is determined through a face detection algorithm, if the hand gesture number is determined to reach the first preset value, human bodies of the participants in regions corresponding to the hand gestures are respectively determined through a human body detection algorithm, and the face detection is carried out on the human bodies of the participants in the corresponding regions to obtain the face positions of the participants;

specifically, a first preset value is preset as a defined value of the face position determination method, wherein the first preset value refers to a preset hand gesture number value, when the hand gesture number reaches the first preset value, it is proved that the number of participants needing to process head images is large, in order to improve the accuracy of correspondence between the hand gestures of the participants and face position information, after the hand gestures of the participants are recognized, a human body area connected with the hand gestures is recognized and detected through a human body detection algorithm, then the upper half part of the human body area is intercepted, and the upper half part of the image is input into a pre-constructed neural network model capable of recognizing facial features to track the face position. When the hand gesture quantity does not reach the first preset value, the number of the participants who need to process the head images is proved to be less, face recognition can be directly carried out on corresponding areas near the hand gestures of the participants, and the human body does not need to be detected through a human body detection algorithm.

And secondly, when the hand gesture information contains the hand gesture interval, the system judges whether the hand gesture interval reaches a second preset value, if the hand gesture interval is determined to reach the second preset value, the face position of the participant in the first video frame is determined through a face detection algorithm, if the hand gesture interval is determined not to reach the second preset value, the human body of the participant in the corresponding area of each hand gesture is determined through a human body detection algorithm, the human body of the participant in the corresponding area is subjected to face detection, and the face position of the participant is obtained.

Specifically, a second preset value is preset as a defined value of the face position determination method, wherein the second preset value refers to a preset hand gesture distance, and when the distance between every two hand gestures reaches the second preset value, the number of the participants needing to process the head images is proved to be large; when the distance between every two hand gestures does not reach a first preset value, the number of the participants who need to process the head images is proved to be less. The face position determination method implemented for these two cases is the method described in the first embodiment.

204. The system determines the face key points of the participants according to the face position information;

205. the system determines a face action area according to the face key points;

206. the system determines the type of the control request initiated by the participant according to the hand gesture information;

207. when the control request type is adding virtual head portraits, the system sets any virtual head portraits in the database to fill the human face acting area, when the control request type is replacing the virtual head portraits, the system sets any virtual head portraits in the database except the original virtual head portraits to fill the human face acting area, and when the control request type is removing the virtual head portraits, the system hides the virtual head portraits filled in the human face acting area;

in the embodiment of the present application, specifically, after the system determines the face position information of the participant, the system needs to determine the action area for the participant according to the face position, so that the action area can be subsequently operated according to the control request type. Specifically, the manner of determining the face region to which the participant can be acted may be: firstly, determining face key points of the participants according to the position coordinates of the face, wherein the face key points can be coordinates of center points of eyes, nose, ears, lips and the like, and the determination method of the face key points includes but is not limited to the following three ways: ASM (active Shape model) and AAM (active appearance model) based methods; a cascade shape regression-based method and a deep learning-based method.

Furthermore, in order to increase the interest of interaction between a user participating in a multi-person video conference and the system, the system can acquire a certain amount of head portrait materials in advance, a database is established, when a participant initiates a control request for adding a virtual head portrait, the system can randomly extract a virtual head portrait from the database to fill in a corresponding human face action area, meanwhile, in order to consider the aesthetic requirements of the participant, when the participant initiates a control request for replacing the virtual head portrait, the system can extract head portraits except the original virtual head portrait in the database for the second time, and therefore the selection diversity of the user during the video conference is increased.

It should be further described that, in the embodiment of the present application, in a scene of a multi-person video conference, human faces and gestures can be simultaneously recognized by multiple persons, and certain avatar processing can be simultaneously performed on participants who have avatar processing requirements.

208. When the system determines that the type of the control request initiated by the participant is not to remove the virtual avatar, detecting the behavior state of the participant in real time;

step 208 in this embodiment is similar to step 105 in the previous embodiment, and is not repeated here.

209. When detecting that the behavior state meets a first preset condition, the system executes the operation of replacing the virtual head portrait for the participants;

Two embodiments of the system for performing the virtual avatar replacement operation for the participants will be described below.

When the behavior state comprises a hand action state of a participant and a first preset condition is a preset frequency;

the system can judge whether the hand action frequency of the participants meets the preset frequency, if so, the emotion of the participants is determined to change, and the operation of replacing the virtual head portrait is executed for the participants. For example, if the first preset condition is that the hand movement changes 25 times per minute, the virtual head portrait of the participant is originally a static cartoon boy head portrait, the hand movement changes 10 times per minute, and when the system detects that the frequency of the hand movement of the participant has reached 30 times per minute, it is determined that the current emotion of the participant is high, and the static cartoon boy head portrait is automatically changed into a dynamic cartoon boy head portrait.

Secondly, when the behavior state comprises the sound state of the participant and the first preset condition is a preset decibel;

the system can judge whether the hand action frequency of the participants meets the preset decibel, and if so, the emotion of the participants is determined to be changed, and the operation of replacing the virtual head portrait is executed for the participants. For example, if the first preset condition is that the sound decibel reaches 70 decibels, the virtual head portrait of the participant is originally a faceless girl head portrait, the speaking decibel is 40 decibels, when the system detects that the speaking decibel of the participant reaches 70 decibels, the current emotion of the participant is determined to be high, and the faceless girl head portrait is automatically replaced by any animation character head portrait with a head fire.

210. The system controls the virtual avatar to move as the face position moves.

In the embodiment of the application, the virtual avatar can move along with the movement of the corresponding face position no matter the virtual avatar is a static avatar or a dynamic avatar.

In the embodiment of the application, the gesture triggering head portrait processing operation of the user is remotely realized, and meanwhile, the corresponding database is also arranged, so that the control request type can correspond to one or more triggering gestures, and the interaction between the user and the processing device of the multi-person video conference head portrait has diversity. In addition, the added virtual avatar can also move along with the movement of the corresponding face position.

In the foregoing, a method for processing a multi-person video conference avatar in the embodiment of the present application is described, and then, a processing apparatus for a multi-person video conference avatar in the embodiment of the present application is described as follows:

referring to fig. 3, an embodiment of a processing apparatus for a multi-person video conference avatar in an embodiment of the present application includes:

the first acquiring unit 301 is configured to acquire a first video of a multi-person video conference scene, and acquire a first video frame through the first video, where the first video frame includes face and hand pictures of all participants in the conference scene;

a first determining unit 302, configured to determine, through a human gesture detection algorithm, hand gesture information of a participant in a first video frame, where the hand gesture information is gesture information matched with a predefined gesture;

the second determining unit 303 is configured to determine the face positions of the participants according to the hand gesture information;

a third determining unit 304, configured to determine a type of a control request initiated by a participant according to the hand gesture information, and perform a corresponding operation on the face position according to the type of the control request, where the type of the control request includes adding a virtual avatar, replacing the virtual avatar, and removing the virtual avatar;

a behavior detection unit 305 configured to detect a behavior state of the participant in real time when the third determination unit 304 determines that the type of the control request initiated by the participant is not to remove the virtual avatar;

a first executing unit 306, configured to execute an operation of replacing the virtual avatar for the participant when the behavior detecting unit 305 detects that the behavior state satisfies the first preset condition.

In this embodiment, after the first determining unit 302 determines the hand gesture information of the participant in the first video frame of the first acquiring unit 301, the second determining unit 303 determines the face position information of the participant according to the hand gesture information, then the third determining unit 304 determines the type of the control request initiated by the participant, and performs a corresponding operation on the face position according to the type of the control request, when the third determining unit 304 determines that the type of the control request initiated by the participant is not to remove the virtual avatar, the behavior detecting unit 305 detects the behavior state of the participant in real time, and when the behavior detecting unit 305 detects that the behavior state satisfies the first preset condition, the first executing unit 306 performs an operation of replacing the virtual avatar for the participant. In a video multi-person conference scene, virtual head portraits can be automatically changed for participants through the behavior states of the participants, and the experience sense of users participating in a video conference is improved.

Referring to fig. 4, an embodiment of another processing apparatus for a multi-person video conference avatar in an embodiment of the present application includes:

a first obtaining unit 401, configured to obtain a first video of a multi-person video conference scene, and obtain a first video frame through the first video, where the first video frame includes face and hand pictures of all participants in the conference scene;

a first determining unit 402, configured to determine, through a human gesture detection algorithm, hand gesture information of a participant in a first video frame, where the hand gesture information is gesture information matched with a predefined gesture;

a second determining unit 403, configured to determine the face positions of the participants according to the hand gesture information;

a fourth determining unit 404, configured to determine face key points of the participant according to the face position information;

a fifth determining unit 405, configured to determine a face acting region according to the face key point;

a third determining unit 406, configured to determine a type of a control request initiated by a participant according to the hand gesture information, and perform a corresponding operation on the face position according to the type of the control request, where the type of the control request includes adding a virtual avatar, replacing the virtual avatar, and removing the virtual avatar;

a behavior detection unit 407, configured to detect a behavior state of the participant in real time when the third determination unit 406 determines that the type of the control request initiated by the participant is not to remove the virtual avatar;

a first executing unit 408, configured to execute an operation of replacing the virtual avatar for the participant when the behavior detecting unit 407 detects that the behavior state satisfies the first preset condition;

and a movement control unit 409 for controlling the virtual avatar to move along with the movement of the face position.

In the embodiment of the present application, the second determining unit 403 may include:

a third judging module 4031, configured to judge whether the number of hand gestures reaches a first preset value;

a fourth execution module 4032, configured to determine, when the third determination module 4031 determines that the number of hand gestures does not reach the first preset value, the face position of the participant in the first video frame through a face detection algorithm;

a fifth execution module 4033, configured to, when the third determination module 4031 determines that the number of the hand gestures reaches the first preset value, determine, through a human body detection algorithm, human bodies of the participants in the areas corresponding to the hand gestures, respectively, and perform face detection on the human bodies of the participants in the corresponding areas, so as to obtain face positions of the participants.

In this embodiment of the application, the second determining unit 403 may further include:

a fourth judging module 4034, configured to judge whether the hand gesture distance reaches a second preset value;

a sixth executing module 4035, configured to determine, when the fourth determining module 4034 determines that the hand gesture distance reaches the second preset value, the face position of the participant in the first video frame through a face detection algorithm;

a seventh executing module 4036, configured to, when the fourth determining module 4034 determines that the hand gesture distance does not reach the second preset value, respectively determine, through a human body detection algorithm, the human bodies of the participants in the areas corresponding to the hand gestures, and perform face detection on the human bodies of the participants in the corresponding areas, so as to obtain the face positions of the participants.

In this embodiment, the first execution unit 408 may include:

the first judging module 4081 is configured to judge whether the hand motion frequency of the participant meets a preset frequency;

the second executing module 4082 is configured to, when the first determining module 4081 determines that the hand motion frequency of the participant meets the preset frequency, execute an operation of replacing the virtual avatar for the participant.

In this embodiment of the application, the first execution unit 408 may further include:

the second judging module 4083 is configured to judge whether the decibel of the sound emitted by the participant meets a preset decibel;

a third executing module 4084, configured to execute an operation of replacing the virtual avatar for the participant when the second determining module 4083 determines that the decibel of the sound emitted by the participant satisfies the preset decibel.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a processing apparatus for a multi-person video conference head portrait, which includes:

a processor 501, a memory 502, an input/output unit 503, and a bus 504;

the processor 501 is connected with the memory 502, the input/output unit 503 and the bus 504;

the processor 501 specifically performs the following operations:

determining the face positions of the participants according to the hand gesture information;

determining the type of a control request initiated by a participant according to the hand gesture information, and executing corresponding operation on the face position according to the type of the control request, wherein the type of the control request comprises adding a virtual head portrait, replacing the virtual head portrait and removing the virtual head portrait;

and when the behavior state is detected to meet the first preset condition, executing the operation of replacing the virtual head portrait for the participant.

In this embodiment, the functions of the processor 501 correspond to the steps in the embodiments shown in fig. 1 to fig. 2, and are not described herein again.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a program is stored, where the program, when executed on a computer, executes the processing methods shown in the foregoing fig. 1 to fig. 2.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method for processing a multi-person video conference head portrait is characterized by comprising the following steps:

2. The method of claim 1, wherein the behavior state comprises a hand motion state of the participant, and the first predetermined condition is a predetermined frequency;

3. The method of claim 1, wherein the behavior state comprises a sound state of the participant, and the first predetermined condition is a predetermined decibel;

4. The method for processing the avatar for the multi-person video conference as claimed in any one of claims 2 or 3, wherein after the operation of replacing the virtual avatar is performed for the participant, the method further comprises:

5. The method of claim 4, wherein the hand gesture information comprises a number of hand gestures;

judging whether the hand gesture number reaches a first preset value or not;

6. The method of claim 4, wherein the hand gesture information comprises a hand gesture distance;

judging whether the hand gesture distance reaches a second preset value;

7. The method for processing the avatar for the multi-person video conference as claimed in claim 1, wherein before determining the type of the control request initiated by the participant according to the hand gesture information, the method further comprises:

8. The method of claim 7, wherein the predefined gesture is stored in a database;

9. A device for processing a multi-person video conference avatar, comprising:

10. A processing device for a multi-person video conference avatar, the processing device comprising:

the device comprises a processor, a memory, an input and output unit and a bus;

the memory holds a program that the processor calls to execute the processing method according to any one of claims 1 to 8.