CN113014857A

CN113014857A - Control method and device for video conference display, electronic equipment and storage medium

Info

Publication number: CN113014857A
Application number: CN202110215105.9A
Authority: CN
Inventors: 梅书慧
Original assignee: Youmi Technology Shenzhen Co ltd
Current assignee: Youmi Technology Shenzhen Co ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-22

Abstract

The application discloses a control method and device for video conference display, electronic equipment and a storage medium, wherein the control method for video conference display comprises the following steps: acquiring media data of a video conference, wherein the media data comprises video data and audio data; acquiring first identity information according to the video data; acquiring second identity information according to the audio data; determining a speaker of the video conference according to the first identity information and the second identity information; and displaying an image picture containing the face of the speaker according to the received display instruction. The method realizes that the speaker of the video conference is determined according to the first identity information and the second identity information which are respectively obtained according to the video data and the audio data of the video conference, and the image picture containing the face of the speaker is displayed according to the received display instruction, so that the identification accuracy of the speaker in the video conference process can be improved, and the display accuracy of the image picture of the speaker in the video conference process can be further improved.

Description

Control method and device for video conference display, electronic equipment and storage medium

Technical Field

The present application relates to the field of multimedia communication technologies, and in particular, to a method and an apparatus for controlling video conference display, an electronic device, and a storage medium.

Background

In a video conference, a speaker is a focused party in the conference, and participants need to focus on the content of the speaker, such as a presentation (PowerPoint, PPT), and meanwhile, the expression, the action, and the sound of the speaker can all convey important information in time, so that the image frame of the speaker is also the focused content of the participants.

The traditional video conference system shoots participants by using a camera, positions a speaker of the video conference by using a microphone array, and displays an image of the positioned speaker in a video conference image so as to be displayed as an image picture of the speaker of the video conference. However, when a speaker of a video conference is located by using a microphone array, the speaker is susceptible to environmental noise and thus inaccurate in location, which causes inaccurate location of a speaker of the video conference and further causes an image frame of the speaker of the video conference to be displayed incorrectly, thereby reducing display accuracy of the image frame of the speaker during the video conference.

Disclosure of Invention

In view of the above problems, the present application provides a method, an apparatus, an electronic device, and a storage medium for controlling video conference display, which can determine a speaker of a video conference according to first identity information and second identity information respectively obtained from video data and audio data of the video conference, and display an image picture including a face of the speaker according to a received display instruction, and can determine the speaker according to voice recognition and video recognition at the same time, thereby eliminating adverse effects of environmental noise on the recognition process, and having high accuracy on speaker recognition, so that the display accuracy of the image picture of the speaker in the video conference process is relatively high, the recognition accuracy of the speaker in the video conference process can be improved, and the display accuracy of the image picture of the speaker in the video conference process can be improved.

In a first aspect, an embodiment of the present application provides a method for controlling video conference display, including: acquiring media data of a video conference, wherein the media data comprises video data and audio data; acquiring first identity information according to the video data; acquiring second identity information according to the audio data; determining a speaker of the video conference according to the first identity information and the second identity information; and displaying an image picture containing the face of the speaker according to the received display instruction.

In a second aspect, an embodiment of the present application provides a control apparatus for video conference display, including: the first acquisition module is used for acquiring media data of the video conference, wherein the media data comprises video data and audio data; the second acquisition module is used for acquiring the first identity information according to the video data; the third acquisition module is used for acquiring second identity information according to the audio data; the determining module is used for determining a speaker of the video conference according to the first identity information and the second identity information; and the display module is used for displaying an image picture containing the face of the speaker according to the received display instruction.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of controlling a video conference display as provided in the first aspect above.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the control method for video conference display as provided in the first aspect.

According to the scheme provided by the application, by acquiring the media data of the video conference, wherein the media data comprises video data and audio data, acquiring the first identity information according to the video data, acquiring the second identity information according to the audio data, determining the speaker of the video conference according to the first identity information and the second identity information, and displaying the image picture containing the face of the speaker according to the received display instruction, the speaker of the video conference is determined according to the video data and the audio data of the video conference and the first identity information and the second identity information which are respectively acquired, the image picture containing the face of the speaker is displayed according to the received display instruction, the speaker can be determined according to sound identification and video identification at the same time, and the adverse effect of environmental noise on the identification process is eliminated, the accuracy of the method for identifying the lecturer is high, so that the display accuracy of the image picture of the lecturer in the video conference process is relatively high, the identification accuracy of the lecturer in the video conference process can be improved, and the display accuracy of the image picture of the lecturer in the video conference process can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic operating environment diagram of a video conference system according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for controlling a video conference display according to an embodiment of the present application.

Fig. 3 shows another flowchart of a control method for video conference display according to an embodiment of the present application.

Fig. 4 shows another flowchart of a control method for video conference display provided in an embodiment of the present application.

Fig. 5 shows a schematic display manner of a video conference in an application scenario according to an embodiment of the present application.

Fig. 6 shows a schematic display manner of a video conference in another application scenario provided by the embodiment of the present application.

Fig. 7 shows a schematic diagram of program modules of a control device for a video conference display according to an embodiment of the present application.

Fig. 8 shows a functional module schematic diagram of an electronic device according to an embodiment of the present application.

Fig. 9 illustrates a computer-readable storage medium storing or carrying program codes for implementing the control method for video conference display according to the embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Video conference, which means two or more than two individuals or groups in different places, through a video conference system, the individuals or groups in different places communicate in the same conference room, both parties of a conversation can see each other and can communicate with each other in language, and also can see the expressions and actions of the other party, for example, through a transmission line and multimedia equipment, the voice, image and file data are mutually transmitted, and the real-time and interactive communication is realized.

In view of the above problems, the inventors have studied and proposed a control method, an apparatus, an electronic device and a storage medium for video conference display provided by the embodiments of the present application for a long time, the first identity information and the second identity information can be acquired respectively through the acquired video data and audio data of the video conference, and determines the speaker of the video conference according to the first identity information and the second identity information, displays the image picture containing the face of the speaker, can determine the speaker according to the voice recognition and the video recognition at the same time, eliminates the adverse effect of the environmental noise on the recognition process, the accuracy of the speaker identification is higher, so that the display accuracy of the image picture of the speaker in the video conference process is relatively higher, the method and the device can improve the identification accuracy of the speaker in the video conference process, and further improve the display accuracy of the image picture of the speaker in the video conference process.

Referring to fig. 1, a schematic diagram of an application scenario provided in an embodiment of the present application is shown, where the application scenario includes a video conference system 100, and the video conference system 100 may include a control module 110, a video data acquisition module 120, an audio data acquisition module 130, and a display module 140. The control module 110 may communicate with the video data acquisition module 120, the audio data acquisition module 130, and the display module 140 through a network, so as to realize data interaction between the control module 110 and the video data acquisition module 120, the audio data acquisition module 130, and the display module 140, respectively, and further realize a video conference process. The video data acquisition module 120 may include various image sensors, cameras, video recorders, etc.; the audio data acquisition module 130 may include a microphone, a device having an application program for the advanced audio data built therein, and the like; the Display module 140 may include a Cathode Ray Tube (CRT) Display, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) Display, a three-dimensional (3D) Display, and the like; the Network may include a Controller Area Network (CAN), a bluetooth Network, an infrared Network, a Digital Living Network Alliance (DLNA) Network, a ZigBee (ZigBee) Network, a Wide Area Network (WAN), a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wireless Personal Area Network (WPAN), and the like, without limitation.

When the video conference system 100 is used for a video conference, the control module 110 can respectively control the video data acquisition module 120 to acquire video data in the video conference process, the audio data acquisition module 130 acquires audio data in the video conference process, and can acquire first identity information according to the acquired video data, and can acquire second identity information according to the acquired audio data, and can determine a speaker of the video conference according to the first identity information and the second identity information, and can display an image picture containing the face of the speaker in the display module 140 according to a received display instruction, thereby realizing the control process of displaying the video conference.

Please refer to fig. 2, which shows a flowchart of a method for controlling a video conference display according to an embodiment of the present application, where the method for controlling a video conference display can be used to determine a speaker of a video conference according to video data and audio data of the video conference, and can display an image picture including a face of the speaker according to a received display instruction, and can determine the speaker according to voice recognition and video recognition at the same time, so as to eliminate adverse effects of environmental noise on the recognition process, and have high accuracy on speaker recognition, so that the accuracy of displaying the image picture of the speaker in the video conference process is relatively high, and the accuracy of recognizing the speaker in the video conference process can be improved, and further the accuracy of displaying the image picture of the speaker in the video conference process can be improved. In a specific embodiment, the method for controlling the video conference display can be applied to the video conference system 100 shown in fig. 1, and the flow shown in fig. 2 will be described in detail below by taking the video conference system 100 as an example, and the method for controlling the video conference display can include the following steps S210 to S250.

Step S210: media data for the video conference is obtained.

In the embodiment of the present application, the media data may include video data as well as audio data. Video data refers to a continuous sequence of images that is essentially composed of a set of consecutive images, and for the images themselves, there is no structural information other than the order in which they appear. The image sequence is a frame sequence, and the video data is composed of a large number of frames which are related to each other, and the frames are continuous.

The process of digitizing sound is actually a process of performing analog-to-digital conversion (ADC) on continuous analog audio signals from a microphone and other devices at a certain frequency to obtain audio data; the playing of the digitized sound is to convert the audio data into analog audio signals through digital-to-analog conversion (DAC). There are two important metrics in digitizing sound, namely Sampling frequency (Sampling Rate) and Sampling Size (Sampling Size). Common methods for collecting audio data include: directly acquire existing audio, capture and intercept sound using audio processing software, record sound with a microphone, and the like.

In this embodiment of the application, when a video conference system is used for a video conference, the control module may send a video data acquisition instruction to the video data acquisition module and send an audio data acquisition instruction to the audio data acquisition module, the video data acquisition module responds to the received video data acquisition instruction, performs video data acquisition on a video conference process, and returns acquired video data to the control module, the audio data acquisition module responds to the received audio data acquisition instruction, performs audio data acquisition on the video conference process, and returns acquired audio data to the control module.

Step S220: first identity information is obtained according to the video data.

In the embodiment of the application, after the control module acquires the video data of the video conference, a speaker of the video conference can be determined according to the video data, and the first identity information can be acquired according to a face image of the speaker. The first identity information may include facial feature information, which may be used as a feature identifier of the conference participants. A speaker is understood to be a speaking participant in a video conference who is to be determined as a speaker, and a plurality of speakers may be provided, for example, a person who is in a mouth in video data may be determined as a speaker.

In some embodiments, the control module may determine whether a human body exists in the video conference according to the video data before determining the speaker of the video conference according to the video data, and may determine the speaker from the human body of the video conference according to the video data when determining that the human body exists in the video conference.

As an implementation manner, the control module may input the acquired video data to a first feature extraction model trained in advance, the first feature extraction model may respond to the received video data, may extract video feature information of the video data, and return the video feature information to the control module, the control module may compare the received video feature information with pre-stored human feature information, obtain a first association degree used for representing an association between the video feature information and the human feature information, and determine whether the video feature information is feature information of a human body according to the first association degree and a preset first threshold value. When the first association degree is larger than or equal to a first threshold value, determining that the video characteristic information is the characteristic information of the human body; and when the first association degree is smaller than a first threshold value, determining that the video characteristic information is the characteristic information of the object.

The video feature information may include color information, texture information, shape information, spatial relationship information, and the like. The first feature extraction model may be a neural network, a Long Short-Term Memory (LSTM) network, a threshold cycle unit, a simple cycle unit, an auto-encoder, a Decision Tree (DT), a random forest, a feature mean classification, a classification regression Tree, a hidden markov, a K-nearest neighbor (KNN) algorithm, a logistic regression model, Naive Bayes (NB), a Support Vector Machine (SVM), a gaussian model, and a KL divergence (kulback-Leibler) or the like. Of course, the specific feature extraction model may not be limiting.

Step S230: and acquiring second identity information according to the audio data.

In this embodiment of the application, the second identity information may include voiceprint feature information, and after the control module acquires the audio data of the video conference, the control module may determine, according to the audio data, a participant corresponding to the audio data, and acquire the voiceprint feature information of the participant.

In some embodiments, before determining, according to the audio data, the participant corresponding to the audio data, the control module may determine, according to the audio data, whether the sound corresponding to the audio data is a human voice, and when it is determined that the sound corresponding to the audio data is a human voice, the participant corresponding to the audio data may be regarded as a human participant.

As an implementation manner, the control module may input the acquired audio data to a second feature extraction model trained in advance, the second feature extraction model responds to the received video data, may extract voiceprint feature information of the audio data, and returns the voiceprint feature information to the control module, the control module may compare the received voiceprint feature information with pre-stored voiceprint feature information to obtain a second association degree used for representing the association between the voiceprint feature information and the voiceprint feature information, and determine whether the sound corresponding to the audio data is a voice according to the second association degree and a preset second threshold, and when it is determined that the sound corresponding to the audio data is a voice, the voiceprint feature information extracted from the audio data is used as second identity information.

The second feature extraction model may have substantially the same model structure as the first feature extraction model. When the second relevance is greater than or equal to a second threshold value, determining that the sound corresponding to the audio data is human voice; and when the second relevance is smaller than a second threshold value, determining that the sound corresponding to the audio data is ambient noise. The environmental noise may include a ring tone, an animal sound, a whistling sound, a thunder sound, a wind sound, a rain sound, and the like, which is not limited herein.

It should be noted that there is no sequence between step S220 and step S230, and the first identity information may be obtained according to the video data after the second identity information is obtained according to the audio data, which is not limited herein.

Step S240: and determining the speaker of the video conference according to the first identity information and the second identity information.

In the embodiment of the application, after the control module acquires the first identity information and the second identity information, the control module may perform face recognition on the face feature information according to the face feature information, acquire a first identity used for representing the first identity information, perform voice recognition on the voiceprint feature information according to the voiceprint feature information, and acquire a second identity used for representing the second identity information, when the first identity is the same as the second identity, it indicates that a participant corresponding to the first identity information or a participant corresponding to the second identity information is the same, and may determine that the participant is a speaker of the video conference, so that it may be implemented that the speaker of the video conference is determined according to the first identity acquired by the face recognition and the second identity acquired by the voice recognition, and the speaker may be determined according to the voice recognition and the video recognition at the same time, the adverse effect of environmental noise on the identification process is eliminated, the accuracy of identification of the speaker is higher, and the identification accuracy of the speaker in the video conference can be improved. The first identity and the second identity may each include at least one of a name, a meeting serial number, a direction, and the like.

Step S250: an image screen including the face of the speaker is displayed in accordance with the received display instruction.

In the embodiment of the application, after the control module determines the speaker of the video conference according to the first identity information and the second identity information, the control module can display the image picture containing the face of the speaker on the display module according to the received display instruction, so that the display accuracy of the image picture of the speaker in the video conference process can be improved. The display instruction is used for indicating a video conference display mode for displaying an image picture including the face of the speaker, and the display instruction can be determined by a video conference system or can be sent by a user according to the requirement of the user.

Referring to fig. 3, which shows a flowchart of a method for controlling a video conference display according to another embodiment of the present application, the method for controlling a video conference display is applied to the video conference system 100 shown in fig. 1, and the flowchart shown in fig. 3 will be described in detail below by taking the video conference system 100 as an example, and the method for controlling a video conference display may include the following steps S310 to S370.

Step S310: media data for the video conference is obtained.

In this embodiment, the step S310 may refer to the content of the corresponding step in the foregoing embodiments, and is not described herein again.

Step S320: and determining the speaker in the image picture of the participants according to the video data.

In this embodiment, the control module may obtain a face image of a participant of the video conference according to the obtained video data, and determine a speaker of the participant according to a change value of a face feature (such as a lip feature) in the face image within a set time.

In some embodiments, the video feature information may include face images, the control module may input the video data to a first feature extraction model, the first feature extraction model extracts at least one face image in the video data in response to the received video data, where each face image corresponds to a face of a participant, and obtains lip feature information of each face image, and returns each face image and the lip feature information of each face image to the control module, and the control module may determine a speaker in an image picture of the participant according to a change value of the lip feature information in an image picture of an adjacent frame and a preset change threshold.

As an embodiment, the control module may select, from the image pictures of the participants, the participant corresponding to the face image with the change value of the lip feature information being greater than or equal to the preset change threshold value as the speaker. For example, the lip feature information may include lip pixel positions, the control module may obtain a change value of the lip pixel position corresponding to each participant according to the lip pixel position in the image picture of the adjacent frame of each participant, select a change value of the lip pixel position having a change value greater than or equal to a preset change threshold value from the change values of the lip pixel positions as a target lip pixel position change value, and use the participant corresponding to the face image corresponding to the target lip pixel position change value as a speaker.

Step S330: first identity information is determined according to a face image of a speaker.

In this embodiment, the first identity information includes facial feature information corresponding to a speaker, and the control module may perform feature extraction on a face image corresponding to the speaker after determining the speaker from the image pictures of the participants according to the video data, and obtain the facial feature information corresponding to the speaker.

Step S340: and acquiring second identity information according to the audio data.

Step S350: and determining the identity of the speaker of the video conference according to the first identity represented by the first identity information and the second identity represented by the second identity information.

In this embodiment, step S340 and step S350 may refer to the steps corresponding to the foregoing embodiments, and are not described herein again.

Step S360: and acquiring the data of the speaker according to the identity of the speaker.

In this embodiment, the data of the speaker may include text information (such as name, title, academic calendar, history, etc.), picture information (photo and photo related to video conference), etc. After the control module determines the identity of the speaker of the video conference according to the first identity represented by the first identity information and the second identity represented by the second identity information, the control module can acquire the prestored data of the speaker according to the identity of the speaker.

In some embodiments, the video conference system may further include a storage module, the data of the speaker is stored in the storage module in advance, and the control module may obtain the data of the speaker from the storage module according to the identity of the speaker.

In some embodiments, the data of the speaker may be stored in a World Wide Web (WWW) or a cloud server in advance, the control module may be connected to the WWW or the cloud server through the internet, and the control module may obtain the data of the speaker from the WWW or the cloud server through the internet according to the identity of the speaker.

Step S370: and displaying an image picture containing the face and the data of the speaker according to the received display instruction.

In this embodiment, after the control module acquires the data of the speaker according to the identity of the speaker, the control module can display the image picture containing the face of the speaker and the data on the display module according to the received display instruction, so that the conference participants in the video conference can watch the face image of the speaker and know the data of the speaker, and the conference experience of the conference participants in the video conference can be improved.

Furthermore, the identity of the speaker of the video conference can be determined according to the first identity information and the second identity information, the data of the speaker can be obtained according to the identity of the speaker, the image picture containing the face of the speaker and the data can be displayed according to the received display instruction, the conference participants of the video conference can watch the face image of the speaker and can know the data of the speaker, and the conference experience of the conference participants in the video conference can be improved.

Referring to fig. 4, a flowchart of a method for controlling a video conference display according to another embodiment of the present application is shown, where the method for controlling a video conference display is applied to the video conference system 100 shown in fig. 1, and the flowchart shown in fig. 4 will be described in detail below by taking the video conference system 100 as an example, and the method for controlling a video conference display may include the following steps S410 to S470.

Step S410: media data for the video conference is obtained.

Step S420: first identity information is obtained according to the video data.

Step S430: and acquiring second identity information according to the audio data.

In this embodiment, step S410, step S420 and step S430 may refer to the steps corresponding to the foregoing embodiments, and are not described herein again.

Step S440: and acquiring a first confidence corresponding to the first identity according to the first identity information.

The confidence level may also be referred to as reliability, confidence level or confidence coefficient, that is, when the overall parameter is estimated by sampling, the result of the estimation is always uncertain due to the randomness of the sample, and therefore, a probabilistic statement method, that is, an interval estimation method in mathematical statistics, is adopted, that is, how large the corresponding probability between the estimated value and the overall parameter is within a certain allowable error range is, and this corresponding probability is referred to as confidence level.

In this embodiment, the first identity information may include facial feature information corresponding to a speaker, the first identity information may be used to represent the first identity, and the control module may obtain a first confidence level of the first identity represented by the first identity information according to the facial feature information corresponding to the speaker, so as to determine the speaker from the conference participants according to the first confidence level, which may improve the recognition accuracy of the speaker. For example, the control module may obtain that the confidence of the first identity being the participant a is 85%, the confidence of the first identity being the participant B is 30%, the confidence of the first identity being the participant C is 10%, and the like, according to the facial feature information corresponding to the speaker.

In some embodiments, the control module may match facial feature information corresponding to the speaker with a preset information base, where the information base may include identities of a plurality of users and facial features of the plurality of users, and may obtain a plurality of first matching degrees according to a comparison result between the facial feature information and the facial features of the plurality of users, the plurality of first matching degrees correspond to the identities of the plurality of users one to one, and may select, as the first target matching degree, a first matching degree with a maximum matching degree from the plurality of first matching degrees, and may determine that the identity of the user corresponding to the first target matching degree is the first identity, and may use the first target matching degree as a first confidence degree corresponding to the first identity.

Step S450: and acquiring a second confidence corresponding to the second identity according to the second identity information.

In this embodiment, the second identity information may include voiceprint feature information, the second identity information may be used to represent the second identity, and the control module may obtain, according to the voiceprint feature information, a second confidence level of the second identity represented by the second identity information, so as to determine, according to the second confidence level, the speaker from the conference participants, and may improve the recognition accuracy of the speaker. For example, the control module may obtain, according to the voiceprint feature information corresponding to the speaker, that the confidence level of the second identity, namely, the participant a, is 40%, the confidence level of the second identity, namely, the participant B, is 90%, the confidence level of the second identity, namely, the participant C, is 15%, and the like.

In some embodiments, the control module may match the voiceprint feature information with a preset voiceprint library, where the voiceprint library includes identities of multiple users and voiceprint features of the multiple users, and may obtain multiple second matching degrees according to a comparison result between the voiceprint feature information and the voiceprint features of the multiple users, where the multiple second matching degrees correspond to the identities of the multiple users one to one, and may select, as the second target matching degree, a second matching degree with a maximum matching degree from the multiple second matching degrees, and may determine, as the second identity, an identity of the user corresponding to the second target matching degree, and may use the second target matching degree as a second confidence degree corresponding to the second identity.

Step S460: and when the first identity is the same as the second identity and the first confidence coefficient and the second confidence coefficient are both greater than or equal to the preset confidence coefficient, determining the identity of the speaker as the first identity.

In this embodiment, when the acquired first identity is the same as the acquired second identity, and the first confidence is greater than or equal to the preset confidence, and the second confidence is greater than or equal to the preset confidence, the control module may determine that the identity of the speaker is the first identity or the second identity, so that the identity of the speaker of the video conference is recognized according to the identity and the confidence, and the recognition accuracy of the identity of the speaker can be improved.

For example, the preset confidence level may be 80%, the control module obtains that the first confidence level of the first identity is 88% of the conference participant a, the confidence level of the first identity is 40% of the conference participant B, the confidence level of the first identity is 15% of the conference participant C, the confidence level of the second identity is 50% of the conference participant a, the confidence level of the second identity is 95% of the conference participant B, and the confidence level of the second identity is 20% of the conference participant C, then the first identity and the second identity are both the conference participant B, and the first confidence level corresponding to the conference participant B is 85% and is greater than the preset confidence level of 80%, and the second confidence level corresponding to the conference participant B is 95% and is greater than the preset confidence level of 80%, then determine that the identity of the presenter is the first identity corresponding to the conference participant B.

Step S470: an image screen including the face of the speaker is displayed in accordance with the received display instruction.

In this embodiment, step S470 may refer to the steps corresponding to the foregoing embodiments, and will not be described herein again.

In some embodiments, the number of the speakers may be multiple, and the control module may adjust the display sizes of the image frames of the multiple faces according to the image frames of the faces of the multiple speakers and the preset display sizes, so that the conference meeting staff and the staff can view the image frames of all the speakers, and the conference experience of the conference meeting staff in the video conference may be improved.

In some embodiments, the display instruction carries designated display area information and display personnel information. The display area information may include shape information and/or position information of a predetermined display area, and the like. When receiving the display instruction, the control module may determine a predetermined display area on the display module according to the display area information, where the predetermined display area is used to display image frames of a plurality of speakers. For example, the predetermined display area may be at a determined location of the display module (e.g., display screen), such as an upper left corner, an upper right corner, an upper edge, a lower edge, and so forth. The predetermined display area may be represented by coordinates of pixel positions on the display screen, for example, when the predetermined display area is a rectangle, the display area information may include coordinates of pixels at four corners of the rectangle, and the display screen may be able to define the predetermined display area by the coordinates of pixels at four corners of the rectangle. The shape, size and position of the predetermined display area can be specified by a user, and when the user sends a control instruction, the display instruction corresponding to the control instruction can carry the shape, size and position of the predetermined display area specified by the user.

In some embodiments, the predetermined display area may be an area fixed at a certain position on the display screen, e.g., the area may be defined by relative coordinates as the predetermined display area. In other embodiments, the predetermined display area may vary depending on the current display content of the display screen. For example, when there is a large blank area in the currently displayed content, the control module may determine the shape, size, location, etc. of the predetermined display area according to the blank area. Based on this, when the control module receives the display instruction, the control module may determine the predetermined display area on the display module according to the display area information and the current display content, including: the method comprises the steps of obtaining current display content, determining an idle area on the current display content, determining the idle area according to color parameter difference between adjacent pixels, determining the area where the specified number of adjacent pixels are located as the idle area when the color parameter difference between the specified number of adjacent pixels is smaller than a preset value, and taking the idle area as a preset display area. Thus, the display position of the image information of the speaker can be flexibly set, and the image information of the speaker is prevented from blocking the displayed content.

The display personnel information may include identity information or/and data information of the speaker to be displayed, and the like. When the control module receives the display instruction, the control module further determines the speaker to be displayed in the multiple speakers according to the display personnel information, such as determining one or more of the multiple speakers. The display personnel information can be specified by a user, and when the user sends a control instruction, the display instruction corresponding to the control instruction can carry the display personnel specified by the user.

Please refer to fig. 5, in some specific application scenarios, the display module may include a display screen, in the video conference process, the user may click the "speaker view" icon in the display screen, the display screen responds to the click operation of the user and may send a preset display instruction to the control module, the control module responds to the received display instruction and may display the image picture containing the face of the speaker on the display screen, thereby completing the operation process of calling the face image of the speaker by the user, so that the user may clearly view the facial expression action of the speaker in real time, and the conference experience of the user may be improved.

In other specific application scenarios, in the video conference process, the control module may display an image including the face of the speaker on the display screen according to the received display instruction in a default display manner, where the default display manner is used to indicate a display manner of a visual effect of a large head shot of a human face. The control module can also adaptively adjust the currently displayed face image according to the operation of the user on the currently displayed face image on the display screen, for example, the control module can display the face image at a corresponding position in the display screen according to the random dragging of the currently displayed face image by the user on the display screen; the control module can also display the face image on the display screen in an enlarged or reduced manner according to the enlargement or reduction operation of the face image currently displayed on the display screen by the user.

In some specific application scenarios, in the video conference process, after the user calls out the face portrait of the speaker, the user can click the "speaker view" icon in the display screen again, the display screen responds to the click operation of the user again, and can send a hiding instruction to the control module, the control module responds to the received hiding instruction, and can hide the image picture containing the face of the speaker in the display screen, so that the hiding operation process of the user on the face image of the speaker is completed, so that the user can concentrate on watching the presentation, and further the conference experience of the user can be improved. Wherein the hiding instruction is for instructing a video conference display mode in which an image screen including a face of a speaker is hidden.

In some embodiments, the control module determines that a plurality of speakers of the video conference are present according to the acquired video data and audio data, as shown in fig. 6, when the speakers of the video conference are two, the control module may display the image frames of the plurality of speakers at the display module simultaneously or display the image frame of the speaker with the largest volume in the plurality of speakers in the display module according to a preset rule, so as to improve the flexibility of the video conference scene display. The preset rule is used for indicating the display modes of the image pictures of the multiple speakers.

In some embodiments, the control module may further determine that the current speaker of the video conference has changed according to the acquired current video data and the acquired current audio data, and may display an image frame including a face of the current speaker on the display module to complete automatic updating of the image frame of the speaker.

Furthermore, a first confidence corresponding to the first identity can be obtained according to the first identity information, a second confidence corresponding to the second identity can be obtained according to the second identity information, and the identity of the speaker of the video conference is determined according to the first identity, the second identity, the first confidence, the second confidence and the preset confidence, so that the identity of the speaker of the video conference can be recognized according to the identity and the confidence, and the identity recognition accuracy of the speaker identity is further improved.

Referring to fig. 7, which shows a schematic block diagram of a program of a control apparatus for video conference display according to another embodiment of the present application, in the embodiment of the present application, a control apparatus 500 for video conference display is applied to the video conference system 100 shown in fig. 1, and the following will describe in detail the control apparatus 500 for video conference display shown in fig. 7 by taking the video conference system 100 as an example, where the control apparatus 500 for video conference display may include a first obtaining module 510, a second obtaining module 520, a third obtaining module 530, a determining module 540, and a display module 550.

The first obtaining module 510 may be configured to obtain media data of a video conference, where the media data includes video data and audio data; the second obtaining module 520 may be configured to obtain the first identity information according to the video data; the third obtaining module 530 may be configured to obtain the second identity information according to the audio data; the determining module 540 may be configured to determine a speaker of the video conference according to the first identity information and the second identity information; the display module 550 may be configured to display an image frame including the face of the speaker according to the received display instruction.

In some embodiments, the second obtaining module 520 may include a first determining unit and a second determining unit.

The first determining unit may be configured to determine a speaker in an image picture of a participant according to the video data; the second determination unit may be configured to determine first identity information from a face image of the speaker, the first identity information including facial feature information corresponding to the speaker.

In some embodiments, the first determination unit may include an extraction subunit, an acquisition subunit, and a first determination subunit.

The extracting subunit may be configured to extract at least one face image in the video data, where each face image corresponds to a face of a participant; the acquiring subunit may be configured to acquire lip feature information of each face image respectively; the first determining subunit may be configured to determine the speaker in the image pictures of the conference participants according to the change value of the lip feature information in the image pictures of the adjacent frames and a preset change threshold.

In some embodiments, the first determining subunit may be further configured to select, from the image pictures of the conference participants, the conference participant corresponding to the face image with the change value of the lip feature information being greater than or equal to the preset change threshold as the speaker.

In some embodiments, the determining module 540 may include a first obtaining unit, a second obtaining unit, and a third determining unit.

The first obtaining unit may be configured to obtain a first confidence corresponding to the first identity according to the first identity information; the second obtaining unit may be configured to obtain, according to the second identity information, a second confidence corresponding to the second identity; the third determining unit may be configured to determine the identity of the speaker as the first identity when the first identity is the same as the second identity and the first confidence and the second confidence are both greater than or equal to a preset confidence.

In some embodiments, the first identity information may include facial feature information, and the first obtaining unit may include a first matching subunit, a first obtaining subunit, a first selecting subunit, a second determining subunit, and a first characterizing subunit.

The first matching subunit may be configured to match the facial feature information with a preset information base, where the information base includes identities of multiple users and facial features of the multiple users; the first obtaining subunit is configured to obtain, according to a comparison result between the facial feature information and facial features of the multiple users, multiple first matching degrees, where the multiple first matching degrees correspond to identities of the multiple users one to one; the first selecting subunit may be configured to select, as the first target matching degree, the first matching degree with the largest matching degree from the plurality of first matching degrees; the second determining subunit may be configured to determine that the identity of the user corresponding to the first target matching degree is the first identity; the first characterization subunit may be configured to use the first target matching degree as a first confidence corresponding to the first identity.

In some embodiments, the second identity information may include voiceprint feature information, and the second obtaining unit may include a second matching subunit, a second obtaining subunit, a second selecting subunit, a third determining subunit, and a second characterizing subunit.

The second matching subunit may be configured to match the voiceprint feature information with a preset voiceprint library, where the voiceprint library includes identities of multiple users and voiceprint features of the multiple users; the second obtaining subunit is configured to obtain a plurality of second matching degrees according to a comparison result between the voiceprint feature information and voiceprint features of the plurality of users, where the plurality of second matching degrees correspond to identities of the plurality of users one to one; the second selecting subunit may be configured to select, as the second target matching degree, the second matching degree with the largest matching degree from the plurality of second matching degrees; the third determining subunit may be configured to determine that the identity of the user corresponding to the second target matching degree is the second identity; the second characterization subunit may be configured to use the second target matching degree as a second confidence corresponding to the second identity.

In some embodiments, the determining module 540 may further include a fourth determining unit.

The fourth determining unit may be configured to determine the identity of the speaker of the video conference according to the first identity represented by the first identity information and the second identity represented by the second identity information.

In some embodiments, the display module 550 may include a third acquisition unit and a display unit.

The third obtaining unit can be used for obtaining the data of the speaker according to the identity of the speaker; the display unit may be configured to display an image including a face and material of the speaker according to the received display instruction.

In some embodiments, there may be a plurality of speakers, and the control device 500 for video conference display may further include an adjustment module.

The adjusting module can be used for adjusting the display sizes of the image pictures of the faces according to the image pictures of the faces of the speakers and the preset display sizes.

Referring to fig. 8, which shows a functional block diagram of an electronic device 600 provided by an embodiment of the present application, the electronic device 600 may include one or more of the following components: memory 610, processor 620, and one or more applications, where the one or more applications may be stored in memory 610 and configured to be executed by the one or more processors 620, the one or more programs configured to perform methods as described in the foregoing method embodiments.

The Memory 610 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 610 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 610 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., obtaining media data, obtaining first identity information, obtaining second identity information, determining a speaker, displaying instructions, displaying an image screen, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the electronic device 600 during use (e.g., video data, audio data, first identity information, second identity information), and so on.

Processor 620 may include one or more processing cores. The processor 620 interfaces with various components throughout the electronic device 600 using various interfaces and circuitry to perform various functions of the electronic device 600 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 610 and invoking data stored in the memory 610. Alternatively, the processor 620 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 620 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 620, but may be implemented by a communication chip.

Referring to fig. 9, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer readable medium 700 has stored therein a program code 710, the program code 710 being capable of being invoked by a processor to perform the methods described in the method embodiments above.

The computer-readable storage medium 700 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 700 includes a non-volatile computer-readable storage medium. The computer readable storage medium 700 has storage space for program code 710 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 710 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for controlling a video conference display, comprising:

acquiring media data of a video conference, wherein the media data comprises video data and audio data;

acquiring first identity information according to the video data;

acquiring second identity information according to the audio data;

determining a speaker of the video conference according to the first identity information and the second identity information; and

and displaying an image picture containing the face of the speaker according to the received display instruction.

2. The method of claim 1, wherein the obtaining the first identity information according to the video data comprises:

determining a speaker in the image picture of the participants according to the video data; and

determining the first identity information according to the face image of the speaker, wherein the first identity information comprises facial feature information corresponding to the speaker.

3. The control method according to claim 2, wherein said determining a speaker in the picture of the conference participant according to the video data comprises:

extracting at least one face image in the video data, wherein each face image corresponds to the face of one participant;

respectively acquiring lip feature information of each face image; and

and determining the speaker in the image pictures of the participants according to the change value of the lip feature information in the image pictures of the adjacent frames and a preset change threshold value.

4. The control method according to claim 3, wherein the determining the speaker in the image pictures of the conference participants according to the change value of the lip feature information in the image pictures of the adjacent frames and a preset change threshold value comprises:

and selecting the conference participants corresponding to the face images with the lip feature information variation value larger than or equal to the preset variation threshold value from the image pictures of the conference participants as the speakers.

5. The method of claim 1, wherein the determining the speaker of the video conference based on the first identity information and the second identity information comprises:

acquiring a first confidence corresponding to the first identity according to the first identity information;

acquiring a second confidence corresponding to the second identity according to the second identity information; and

and when the first identity is the same as the second identity and the first confidence coefficient and the second confidence coefficient are both greater than or equal to a preset confidence coefficient, determining the identity of the speaker as the first identity.

6. The control method of claim 5, wherein the first identity information includes facial feature information, and the obtaining a first confidence level corresponding to the first identity based on the first identity information comprises:

matching the facial feature information with a preset information base, wherein the information base comprises identities of a plurality of users and facial features of the users;

obtaining a plurality of first matching degrees according to the comparison result of the facial feature information and the facial features of the users, wherein the first matching degrees correspond to the identities of the users one by one;

selecting the first matching degree with the maximum matching degree from the plurality of first matching degrees as a first target matching degree;

determining the identity of the user corresponding to the first target matching degree as a first identity; and

and taking the first target matching degree as a first confidence corresponding to the first identity.

7. The control method according to claim 5, wherein the second identity information includes voiceprint feature information, and the obtaining a second confidence corresponding to the second identity according to the second identity information includes:

matching the voiceprint characteristic information with a preset voiceprint library, wherein the voiceprint library comprises identities of a plurality of users and voiceprint characteristics of the plurality of users;

obtaining a plurality of second matching degrees according to the comparison result of the voiceprint feature information and the voiceprint features of the users, wherein the second matching degrees correspond to the identities of the users one by one;

selecting the second matching degree with the maximum matching degree from the plurality of second matching degrees as a second target matching degree;

determining the identity of the user corresponding to the second target matching degree as a second identity; and

and taking the second target matching degree as a second confidence degree corresponding to the second identity.

8. The method of claim 1, wherein the determining the speaker of the video conference based on the first identity information and the second identity information comprises:

determining the identity of a speaker of the video conference according to a first identity represented by the first identity information and a second identity represented by the second identity information;

the displaying an image including the face of the speaker according to the received display instruction includes:

obtaining the speaker's data according to the speaker's identity, and

and displaying an image picture containing the face of the speaker and the data according to the received display instruction.

9. The method according to any one of claims 1 to 8, wherein the number of the speakers is plural, the method further comprising:

and adjusting the display sizes of the image pictures of the faces according to the image pictures of the faces of the speakers and a preset display size.

10. A control apparatus for a video conference display, comprising:

the first acquisition module is used for acquiring media data of a video conference, wherein the media data comprises video data and audio data;

the second acquisition module is used for acquiring first identity information according to the video data;

the third acquisition module is used for acquiring second identity information according to the audio data;

the determining module is used for determining a speaker of the video conference according to the first identity information and the second identity information; and

and the display module is used for displaying an image picture containing the face of the speaker according to the received display instruction.

11. An electronic device, comprising:

a memory;

one or more processors coupled with the memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the control method of any of claims 1 to 9.

12. A computer-readable storage medium, in which a program code is stored, the program code being called by a processor to execute the control method according to any one of claims 1 to 9.