CN114125365A

CN114125365A - Video conference method, device and readable storage medium

Info

Publication number: CN114125365A
Application number: CN202111413149.9A
Authority: CN
Inventors: 宿绍勋
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-03-01

Abstract

The embodiment of the application provides a video conference method, a video conference device and a readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: acquiring a video image and a voice signal synchronous with the video image; and determining the speaker according to the acquired video image and/or voice signal, and carrying out identity annotation on the speaker in the video image. According to the method, the speaker is identified based on the voice signal of the speaker and the video image containing the speaker, multiple speakers can be rapidly determined in the multi-person video conference, and compared with a method for identifying the speaker based on the voice signal of the speaker or the video image containing the speaker, the identification result is more accurate.

Description

Video conference method, device and readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a video conference method, an apparatus, and a readable storage medium.

Background

With the rapid development of intelligent recognition technology, more and more scenes need to apply biometric recognition technology to identify speakers. The voiceprint features are acoustic features extracted based on the sound wave frequency spectrum features of the voice signals of the speaker, and can reflect identity information of the speaker, so that the speaker is identified based on the voiceprint features of the speaker at present.

However, the voice signal of the speaker is affected by the environment where the speaker is located, the physiological condition of the speaker, and the like, and the voice signal of the speaker has various uncertainties, so that the voiceprint feature of the speaker has various uncertainties, and the accuracy of speaker recognition is further affected. This problem is particularly acute in the context of multi-person video conferencing.

Disclosure of Invention

The application provides a video conference method, a video conference device and a readable storage medium, which are used for solving the technical problem that a speaker cannot be quickly determined in a multi-person video conference.

In a first aspect, a video conference method is provided, and the method includes:

acquiring a video image and a voice signal synchronous with the video image;

determining a speaker according to the collected video image and/or voice signal;

and carrying out identity annotation on the speaker in the video image.

In one possible implementation, the determining a speaker according to the collected voice signal includes:

determining the voiceprint characteristics of the speaker according to the voice signal;

determining first identity information of the speaker according to the voiceprint characteristics;

searching in the whole meeting field by using a camera to determine the position of a speaker;

the identity labeling of the speaker in the video image comprises the following steps:

and determining related information of the speaker according to the first identity information and the position information corresponding to the position of the speaker, and carrying out identity marking on the speaker in the video image.

In another possible implementation, the method further includes:

when a conference starts, searching in the whole conference field by using a camera to obtain the initial position of each participant;

the method for determining the position of the speaker by utilizing the camera to search in the whole meeting field comprises the following steps:

searching the speaker in a preset range near the initial position according to the first identity information determined by the voiceprint characteristics;

and performing face recognition at the searched initial position to determine the position of the speaker.

In another possible implementation manner, the determining a speaker according to the captured video image and the voice signal includes:

and when the first identity information of at least two speakers is determined according to the voiceprint characteristics, lip recognition is carried out on the video image to determine the speakers.

In another possible implementation manner, the lip recognition of the video image to determine the speaker includes:

performing lip recognition on the video image to obtain second identity information of at least two speakers;

based on the first identity information and the second identity information of at least two speakers, a corresponding speaker is determined.

In another possible implementation manner, the lip-recognizing the video image to obtain second identity information of at least two speakers includes:

cutting out the face in the video image by using a face model to obtain at least two face vectors;

inputting the at least two face vectors into a lip analysis model to obtain lip sequence vectors;

and performing behavior analysis on the lip-shaped sequence vector to acquire a face with speaking behavior and obtain second identity information of at least two speakers.

In another possible implementation manner, the determining the corresponding speaker based on the first identity information and the second identity information of the at least two speakers includes:

when the confidence degrees of the first identity information and the second identity information of one speaker accord with a preset confidence degree rule, obtaining the corresponding speaker, wherein the preset confidence degree rule comprises any one of the following items:

the confidence of the first identity information is greater than a first threshold, and the confidence of the second identity information is greater than a second threshold;

an average of the confidence level of the first identity information and the confidence level of the second identity information is greater than a third threshold.

In a second aspect, there is provided a video conferencing apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video image and a voice signal synchronous with the video image;

the processing module is used for determining a speaker according to the video image and/or the voice signal; and is also used for carrying out identity marking on the speaker in the video image.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the video conference method shown in the first aspect provided by the present application is performed.

For example, a fifth aspect of the present application provides a computing device comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the video conference method as shown in the first aspect or the second aspect of the application.

In a fourth aspect, a computer-readable storage medium is provided, the processor-readable storage medium storing a computer program for causing a processor to implement the video conferencing method of the first aspect of the present application when the computer program is executed

In a fifth aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the various alternative implementations of the first aspect described above.

The beneficial effect that technical scheme that this application provided brought is:

in the video conference method provided by the embodiment of the application, the speakers are identified based on the voice signals of the speakers and the video images containing the speakers, so that a plurality of speakers can be quickly determined in a multi-person video conference, and compared with a method for identifying the speakers based on the voice signals of the speakers only or the video images of the speakers only, the identification result is more accurate.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a video conference method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video conference method according to another embodiment of the present application;

fig. 3 is a schematic flowchart of a video conference method according to another embodiment of the present application;

fig. 4 is a schematic flowchart of a video conference method according to another embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

At present, the speaker is mostly identified based on the voiceprint characteristics of the speaker. Due to various uncertainties of the voice signal of the speaker, the voiceprint characteristics of the speaker also have various uncertainties, and therefore the accuracy of speaker recognition is affected.

Therefore, the method and the device for detecting the speaker are applied to the conference system, and the problem that the speaker cannot be quickly determined in a multi-person video conference is solved. Specifically, the video conference method provided by the embodiment of the application identifies the speaker based on the voice signal of the speaker and the video image containing the speaker, and the identification result is more accurate compared with a method for identifying the speaker based on the voice signal of the speaker only or based on the video image of the speaker only.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

An embodiment of the present application provides a video conference method, and as shown in fig. 1, the method includes:

s101, acquiring a video image and a voice signal synchronous with the video image;

s102, determining a speaker according to the collected video image and/or voice signal;

and S103, carrying out identity annotation on the speaker in the video image.

In this embodiment, after the multi-person video conference starts, a video image of a conference room and a speech signal synchronized with the video image may be collected in real time, and then a speaker (i.e., a participant who is currently speaking, which may also be referred to as a speaker) is determined based on the collected video image and/or speech signal, and an identity tag is performed on the determined speaker in the video image.

It should be noted that, in this embodiment, when the identity of the specified speaker is labeled in the video image, the relevant information of the corresponding speaker may be labeled in the specified speaker position in the video image, where the relevant information of the speaker includes information on the conference nameplate, for example: at least one item of information of names, sexes, positions, units, job levels, contact information and the like of the participants.

It should be further noted that, after the identity of the determined speaker is tagged in the video image, the video image may be sent to the conference devices of the participants, so as to be used by the participants, for example: generating a meeting summary; alternatively, the video image may be transmitted to a third party.

In one possible implementation manner, the determining the speaker according to the collected voice signal in step S102 includes:

s1021, determining the voiceprint characteristics of the speaker according to the voice signal;

s1022, determining first identity information of the speaker according to the voiceprint characteristics;

s1023, searching in the whole conference by using a camera, and determining the position of the speaker;

step S103 may specifically include:

Specifically, in this embodiment, the voice signal can be acquired by real-time sound pickup through a microphone sound pickup array used by the participants, so that the speaker identity information can be recognized in real time by using the conference voice. Specifically, the voiceprint features can be extracted from the obtained voice signals through the voice model to obtain voiceprint vectors, and then the voiceprint vectors are input into the classification model to obtain the first identity information of the speaker.

It should be noted that, in this embodiment, the speech model may adopt a deep neural network DNN model, a convolutional neural network CNN model, or the like. The specific process of extracting the voiceprint features from the obtained voice signal through the voice model, obtaining the voiceprint vectors, and inputting the voiceprint vectors into the classification model to obtain the first identity information of the speaker can be realized by adopting the prior related technology, and is not repeated herein for the simplicity of description.

In another possible implementation, the method further includes:

s100, searching in the whole conference field by using a camera when the conference starts, and acquiring the initial position of each participant;

wherein, step S1023 may specifically include:

Specifically, in this embodiment, before the conference starts, the camera may be controlled to scan the whole field, obtain the initial positions of the participants, and store the initial positions of the participants, the first identity information generated based on the voiceprint feature, the second identity information generated based on the face feature, and the related information in the database. When the voiceprint feature of the speaker is determined according to the collected voice signal, and the first identity information of the speaker is determined based on the voiceprint feature, the corresponding initial position can be searched in the database based on the first identity information, and the accurate position of the speaker is determined by face recognition at the searched initial position.

In one example, voiceprint recognition is performed based on a conference voice signal acquired in real time to obtain first identity information of a speaker, it is determined that one speaker speaks at present, corresponding first position information (namely an initial position) can be searched in a pre-stored database based on the first identity information for a scene of one speaker, then a camera in a conference place is adjusted to acquire a first video image of a position corresponding to the first position information by adopting a focusing mode, and if second identity information (identity information generated based on a face recognition result) corresponding to the first position information in the pre-stored database is detected in the acquired video image, it is determined that the corresponding speaker is recognized.

In one possible implementation: if second identity information corresponding to the first position information in a prestored database is not detected in the acquired video images, the camera is required to be controlled to scan the whole conference room, the second video images are acquired, and when the second identity information is detected in the second video images, the corresponding speaker is determined to be identified, and the second position information of the speaker is obtained; the first location information of the corresponding speaker in the pre-stored database may then also be updated using the second location information.

In another possible implementation manner, the determining the speaker according to the captured video image and the voice signal in step S102 includes:

and S1024, when the first identity information of at least two speakers is determined according to the voiceprint characteristics, lip recognition is carried out on the video image, and the speakers are determined.

In this embodiment, after determining the voiceprint feature of the speaker according to the collected voice signal and determining the first identity information of more than one speaker based on the voiceprint feature, lip recognition may be performed on the collected video image to determine the corresponding speaker.

Specifically, in this embodiment, lip recognition is performed on the video image in step S1024 to determine the speaker, which includes:

s1024a, carrying out lip recognition on the video image to obtain second identity information of at least two speakers;

and S1024b, determining the corresponding speaker based on the first identity information and the second identity information of the at least two speakers.

That is to say, in this embodiment, voiceprint recognition is performed based on a conference voice signal acquired in real time to obtain first identity information of at least two speakers, it is determined that there is a multi-person conversation currently, for a scene of the multi-person conversation, a camera in a conference room may be adjusted to acquire a video image of the whole conference room in a wide-angle mode, lip recognition is performed on the acquired video image to obtain second identity information of the at least two speakers, and then, based on the obtained first identity information and the obtained second identity information, the corresponding speaker is determined.

In some embodiments, step S1024a may specifically include:

That is to say, in this embodiment, when performing lip recognition on the acquired video image, a face vector sequence obtained by cutting the face in the video image by using the face model may be used as an input of the lip analysis model, and a lip sequence vector output by the lip analysis model is subjected to behavior analysis to obtain the face with a speaking behavior, so as to obtain the second identity information of the speaker.

Specifically, in this embodiment, the lip analysis model may employ a CNN-LSTM model, or a 3D-ConvNet model. The CNN-LSTM model is an integrated model of a convolutional neural network CNN and a long-short term memory network LSTM, the CNN part of the model processes data, and one-dimensional results are input into the LSTM model. The 3D-ConvNet model is a convolution neural network model with a convolution kernel of 3D, and a 3D convolution module in the 3D-ConvNet model can be used for extracting time and space characteristics of a video frame in the first video image.

It should be noted that, in this embodiment, a face model is used to cut out a face in a video image, so as to obtain at least two face vectors; inputting at least two face vectors into a lip analysis model to obtain lip sequence vectors; and, the specific process of analyzing the lip-shaped sequence vector to obtain the face with the speaking behavior can be realized by adopting the existing related technology, and is not described herein again for the sake of brevity of description.

In some embodiments, step S1024b may specifically include:

That is, in this embodiment, when there is a conversation among multiple persons currently, the corresponding speaker is determined according to whether the first identity information and the second identity information of the multiple speakers satisfy the preset confidence rule. If the confidence coefficient of the first identity information of a speaker is greater than a first threshold value and the confidence coefficient of the second identity information is greater than a second threshold value, the speaker can be determined; alternatively, a speaker may be determined if the average of the confidence level of the first identity information and the confidence level of the second identity information of the speaker is greater than a third threshold. By adopting the method, the speaker is identified, and the identification result is more accurate.

For example: aiming at a speaker A, carrying out face recognition for n times in the process of speaking the speaker A to obtain confidence coefficients P1, P2 and … Pn of n personal face recognition outputs as the speaker A, and calculating the average confidence coefficient P of the face recognition; meanwhile, n times of voiceprint recognition are carried out, confidence degrees Q1, Q2 and … Qn of n speakers which are recognized and output as 'A' are obtained, and the average confidence degree Q of the voiceprint recognition is calculated. Finally, if P is greater than 95% and Q is greater than 92%, determining that the corresponding speaker is a nail; alternatively, if the mean of P and Q is greater than 90%, the corresponding speaker is determined to be a nail.

The method provided by the embodiment can be applied to the situation that multiple persons participate in the same scene in the video conference scene of the conference machine. That is, for a conference scenario where more than one person is speaking at a time, namely: a multi-person same-frame conference scenario or a multi-person conversation scenario. This is described in detail below with reference to figures 2 and 3.

The method shown in fig. 2 comprises:

s10, picking up conference voice signals in real time by using a microphone sound receiving array, and carrying out speaker voice print recognition;

s11, when the voiceprint identifies that more than one speaker exists at present, adjusting the camera, shooting a full-field video by adopting a wide-angle model, and performing face identification;

s12, recording the identity information of the participants and the position information of the positions in the database;

s13, detecting multiple speakers through lip changes in the video;

s14, recognizing the identity of the speaker through face recognition;

and S15, framing the speaker.

Specifically, in this embodiment, the speaker is recognized by using two modalities, i.e. video and speech, as shown in fig. 3:

s10 is a speaker identity recognition process, which may specifically include:

s10a, extracting the vocal print characteristics through a voice model (such as DNN, CNN and the like) to obtain vocal print vectors.

And S10b, inputting the voiceprint vector into the classification model to obtain the identity of the speaker.

S13-S15 are speaker position detection processes, wherein S13 may specifically include:

and S13a, cutting out the human face in the image by using the human face model, and obtaining N human face vectors.

S13b, inputting the N face vectors into a lip analysis model (for example, CNN-LSTM, 3D-ConvNet) to obtain lip sequence vectors.

And S13c, performing behavior analysis according to the lip sequence vector to obtain the face with speaking behavior.

S15 specifically includes: and combining the results of voiceprint and lip recognition to obtain the speaker through a confidence rule model.

Thus, the identity and position of the speaker are obtained, and the information corresponding to the identity of the speaker in the database is marked on the position of the speaker. At least one item of the following information on the conference name card of the participant is recorded in the database: name, gender, job title, unit, job title, contact, etc.

In summary, the video conference method provided by the embodiment of the present application adopts the method that the speaker is identified based on the voice signal of the speaker and the video image including the speaker at the same time, so that a plurality of speakers can be quickly determined in the multi-person video conference, and the identification result is more accurate compared with the method that the speaker is identified based on only the voice signal of the speaker or only the video image of the speaker.

The method in the embodiment can be applied to the situation that multiple persons participate in the same scene in the video conference scene of the conference machine. A scene in which only one person speaks at a time, that is, a single person speaking scene, can be described in detail with reference to fig. 4.

The method shown in fig. 4 includes:

s20, controlling the camera to scan the whole field and carrying out face recognition;

s21, recording the identity information of the participants and the position information of the positions in the database;

s22, a microphone receiving array is used for picking up conference voice signals in real time to carry out speaker voice print recognition;

s23, when a speaker is identified at present by the voiceprint, adjusting the camera, shooting a video at the position of the speaker by adopting a focusing model, and performing face identification;

s24, whether a human face is found in the video of the position of the human face, if not, executing S25, and if so, executing S26.

S25, controlling the camera to scan the whole field to perform face recognition;

and S26, framing the speaker.

In summary, the video conference method provided by the embodiment of the present application adopts the method that the speaker is identified based on the voice signal of the speaker and the video image including the speaker at the same time, so that the speaker can be quickly determined in the multi-person video conference, and the identification result is more accurate compared with the method that the speaker is identified based on the voice signal of the speaker only or the video image of the speaker only.

It should be understood that the method in each of the above embodiments may be executed by a conference machine, a conference white board, an intelligent conference all-in-one machine, a terminal of an intelligent conference system, and the like, which is not limited in this application embodiment.

The technical solution of the video conference method provided by the embodiment of the present application is described above with reference to fig. 1 to 4, and a video conference device and equipment provided by the embodiment of the present application will be described below.

Based on the same inventive concept, an embodiment of the present application further provides a video conference apparatus, including: an acquisition module and a processing module, wherein,

In some embodiments, the processing module, when determining the speaker from the collected speech signal, is specifically configured to:

when the processing module is used for carrying out identity marking on a speaker, the processing module is specifically used for:

In another embodiment, the obtaining module is further configured to:

when the processing module utilizes the camera to search in the whole conference and determines the position of the speaker, the processing module is specifically configured to:

In other embodiments, the processing module, when determining the speaker according to the captured video image and the speech signal, is specifically configured to:

In some embodiments, the processing module, when performing lip recognition on the video image and determining the speaker, is specifically configured to:

In some embodiments, the processing module is specifically configured to, when performing lip recognition on the video image to obtain the second identity information of the at least two speakers:

In some embodiments, the processing module, when determining the corresponding speaker based on the first identity information and the second identity information of the at least two speakers, is specifically configured to:

In the embodiments, when the identity of the speaker is labeled in the video image, the relevant information of the corresponding speaker is labeled at the position of the speaker determined in the video image, wherein the relevant information of the speaker includes the information on the conference nameplate.

For the content that is not described in detail in the apparatus provided in the embodiment of the present application, reference may be made to the method provided in the embodiment shown in fig. 1 to 4, and the beneficial effects that the apparatus provided in the embodiment of the present application can achieve are the same as the method provided in the embodiment shown in fig. 1 to 4, which are not described herein again.

Based on the same inventive concept, an embodiment of the present application further provides an electronic device, including: a memory and a processor; at least one program, stored in the memory, for execution by the processor, in comparison to the prior art: in the video conference method provided by the embodiment of the application, the speakers are identified based on the voice signals of the speakers and the video images containing the speakers, so that a plurality of speakers can be quickly determined in a multi-person video conference, and compared with a method for identifying the speakers based on the voice signals of the speakers only or the video images of the speakers only, the identification result is more accurate.

The electronic device in the embodiment of the application can be a conference machine, a conference white board, an intelligent conference all-in-one machine, a terminal of an intelligent conference system and the like.

In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 4000 shown in fig. 5 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the method has the advantages that the speakers can be rapidly determined in the multi-person video conference by simultaneously identifying the speakers based on the voice signals of the speakers and the video images containing the speakers, and the identification result is more accurate compared with a method for identifying the speakers based on the voice signals of the speakers or the video images of the speakers.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a processor readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer-executable instructions. These computer-executable instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These processor-executable instructions may also be stored in a processor-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the processor-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These processor-executable instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video conferencing method, comprising:

acquiring a video image and a voice signal synchronous with the video image;

and carrying out identity annotation on the speaker in the video image.

2. The method of claim 1, wherein determining the speaker from the captured speech signal comprises:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein determining the speaker from the captured video image and the speech signal comprises:

5. The method of claim 4, wherein the lip recognizing the video image to determine the speaker comprises:

6. The method of claim 5, wherein lip recognizing the video image to obtain second identity information of at least two speakers comprises:

7. The method of claim 5 or 6, wherein determining the corresponding speaker based on the first identity information and the second identity information of the at least two speakers comprises:

8. A video conferencing apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the video conferencing method of any of claims 1 to 7.

10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program for causing a processor to execute the video conferencing method of any one of claims 1 to 7.