CN107333090B

CN107333090B - Video conference data processing method and platform

Info

Publication number: CN107333090B
Application number: CN201610283899.1A
Authority: CN
Inventors: 赵婧; 曹宁; 徐晓微
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2020-04-07
Anticipated expiration: 2036-04-29
Also published as: CN107333090A

Abstract

The invention provides a video conference data processing method and a video conference data processing platform, and relates to the technical field of video conferences. The video conference data processing method comprises the following steps: acquiring voiceprint information of a speaker; recognizing voiceprint information and determining identity information of a speaker; and carrying out video synthesis on the identity information of the speaker and the video picture so as to display the video picture after the video synthesis. By the method, the identity of the speaker can be identified according to the voice of the participants, and the speaker can be displayed to the participants through the video pictures, so that the participants can conveniently identify the identity of the speaker, and the user experience of the video conference is improved.

Description

Video conference data processing method and platform

Technical Field

The invention relates to the technical field of video conferences, in particular to a video conference data processing method and a video conference data processing platform.

Background

At present, different participants are generally identified in a video conference by a microphone excitation mode, and who speaks is identified according to which sound source sounds.

However, in many cases, especially in a large-scale conference, a plurality of people share one microphone, or a plurality of participants use the same sound source, so that the speaker cannot be identified by the way of distinguishing the sound source alone, the conference effect is greatly affected, and the participants cannot correspond the content of the speech to the speaker, so that a large gap is generated between the effect of the video conference and the live conference, and the user friendliness of the video conference is greatly reduced.

Disclosure of Invention

It is an object of the present invention to propose a solution that facilitates the identification of the speaker by the videoconference user.

According to an aspect of the present invention, a method for processing videoconference data is provided, which includes: acquiring voiceprint information of a speaker; recognizing voiceprint information and determining identity information of a speaker; and carrying out video synthesis on the identity information of the speaker and the video picture so as to display the video picture after the video synthesis.

Optionally, the video picture is a virtual reality video picture, and the video terminal is a virtual reality video display terminal.

Optionally, recognizing the voiceprint information and determining the identity information of the speaker comprises: performing feature matching according to the voiceprint information, and identifying the voiceprint features matched with the voiceprint information; and searching identity information of the speaker corresponding to the matched voiceprint characteristics.

Optionally, the video-synthesizing and displaying the identity information of the speaker with the video picture includes: carrying out video synthesis on the identity information of the speaker and the video picture; and sending the video picture after the video synthesis to a video terminal for display.

Optionally, the method further comprises: extracting vocal print characteristics of the voices of the participants based on the recorded voices of the participants to generate a vocal print library; voiceprint characteristics of the participant's voice are associated with the participant's identity information.

Optionally, the method further comprises: acquiring the facial features of the speaker according to the incidence relation between the voiceprint features and the facial features; positioning a speaker in the video picture according to the facial features of the speaker; and carrying out video synthesis on the positioning identification of the speaker and the video picture.

Optionally, the method further comprises: extracting facial features of the participants; facial features of the participant are associated with voiceprint features.

By the method, the identity of the speaker can be identified according to the voice of the participants, and the speaker can be displayed to the participants through the video pictures, so that the participants can conveniently identify the identity of the speaker, and the user experience of the video conference is improved.

According to another aspect of the present invention, there is provided a video conference platform comprising: the voiceprint information extraction module is used for acquiring the voiceprint information of the speaker; the identity information determining module is used for identifying the voiceprint information and determining the identity information of the speaker; and the video synthesis module is used for carrying out video synthesis on the identity information of the speaker and the video picture so as to display the video picture after video synthesis.

Optionally, the identity information determination module includes: the voiceprint matching unit is used for carrying out feature matching according to the voiceprint information and identifying matched voiceprint features; and the identity information acquisition unit is used for acquiring the identity information of the speaker corresponding to the matched voiceprint characteristics.

Optionally, the video composition module comprises: the video synthesis unit is used for carrying out video synthesis on the identity information of the speaker and the video picture; and the video sending unit is used for sending the video picture after the video synthesis to the video terminal for displaying.

Optionally, the method further comprises: the voiceprint feature extraction module is used for extracting voiceprint features of the voices of the participants based on the recorded voices of the participants to generate a voiceprint library; and the identity information correlation module is used for correlating the voiceprint characteristics of the sound of the participant with the identity information of the participant.

Optionally, the method further comprises: the facial feature acquisition module is used for acquiring the facial features of the speaker according to the incidence relation between the voiceprint features and the facial features; the facial feature positioning module is used for positioning the speaker in the video picture according to the facial features of the speaker; the video synthesis module is also used for carrying out video synthesis on the positioning identification of the speaker and the video picture.

Optionally, the method further comprises: the facial feature extraction module is used for extracting facial features of the participants; and the facial feature association module is used for associating the facial features of the participants with the voiceprint features.

The platform can identify the identity of the speaker according to the voice of the participants and display the identity of the speaker to the participants through the video pictures, so that the participants can conveniently identify the identity of the speaker, and the user experience of the video conference is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a flowchart of an embodiment of a video conference data processing method according to the present invention.

Fig. 2 is a flowchart of another embodiment of a video conference data processing method according to the present invention.

Fig. 3 is a flowchart of a video conference data processing method according to another embodiment of the present invention.

Fig. 4 is a flowchart of a video conference data processing method according to still another embodiment of the present invention.

Fig. 5 is a schematic diagram of one embodiment of a video conferencing platform of the present invention.

Fig. 6 is a schematic diagram of another embodiment of a video conferencing platform of the present invention.

Fig. 7 is a schematic diagram of yet another embodiment of a video conferencing platform of the present invention.

Fig. 8 is a schematic diagram of yet another embodiment of a video conferencing platform of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

A flow diagram of one embodiment of a video conference data processing method of the present invention is shown in fig. 1.

In step 101, voiceprint information of a speaker is acquired. In one embodiment, the sound collected from the microphone may be subjected to audio data processing to obtain voiceprint information of the speaker.

In step 102, voiceprint information is identified and identity information of the speaker is determined. In one embodiment, the voiceprint information of the speaker can be feature-matched according to the voiceprint features of the participants, and the matched voiceprint features can be determined, so that the voiceprint information of the speaker can be determined.

In step 103, the identity information of the speaker is video-composited with the video frame. The synthesized video picture has identity information of the speaker. The video pictures after the synthesis processing can be sent to each terminal of the conference, so that the participants can know the identity information of the speaker while watching the video pictures.

By the method, the identity of the speaker can be identified according to the voice of the participants and displayed to the participants through the video pictures, so that the participants can conveniently identify the identity of the speaker, and the user experience of the video conference is improved.

In one embodiment, the video pictures can be virtual reality video pictures, virtual reality scenes can be created in related meeting places through the virtual reality video display terminals, or atmosphere of a live meeting is created in a mode that participants wear virtual reality video display glasses, and meeting experience is improved. By such a method, the effect of the video conference can be further optimized.

A flow chart of another embodiment of the videoconference data processing method of the present invention is shown in fig. 2.

In step 201, voiceprint information of a speaker is acquired. In one embodiment, the sound collected from the microphone may be subjected to audio data processing to obtain voiceprint information of the speaker.

In step 202, feature matching is performed based on the voiceprint information, identifying voiceprint features that match the voiceprint information.

In step 203, identity information corresponding to the matched voiceprint feature is obtained. The identity information of the speaker may include the name, identity, affiliation, and association information with the conference, etc. of the speaker. The identity and the position of the speaker can be more intuitively positioned according to the information. The identity information of the speaker can also comprise contact ways such as the telephone number of the speaker and the like, so that the participants can conveniently and directly communicate after meeting.

In step 204, the acquired identity information is video-synthesized with the video frame. In one embodiment, the identity information of the speaker may be displayed at a predetermined location of the virtual reality video frame.

In step 205, the video-combined screen is transmitted to the video terminal and displayed.

Since each participant cannot know all participants, especially in a large conference, knowing the sound source cannot know exactly the name and the title of the speaker, information related to the conference, and the like. By the method, the speaker can be identified according to the voice of the participants, the relevant identity information of the speaker is inquired, and the identity information is displayed to the participants through the video pictures, so that the participants can know the identity of the speaker and the background information of the speaker better, and the user experience of the video conference is improved.

In one embodiment, a voiceprint library including voiceprint characteristics of the participant is established, and voiceprint information is identified based on the voiceprint library. A flow chart of yet another embodiment of the videoconference data processing method of the present invention is shown in fig. 3.

In step 301, voiceprint features of the participant's voice are extracted based on the recorded participant's voice, and a voiceprint library is generated. In one embodiment, each participant may be asked to enter a voice before the meeting begins. In another embodiment, only the voices of the participants who have not extracted the voiceprint feature may be entered.

In step 302, the voiceprint characteristics of the participant are associated with the identity information of the participant. In one embodiment, the identity information of each participant may be entered in advance, and the voiceprint feature and the identity information may be associated in a voice entry or voiceprint feature extraction process. In one embodiment, the identity information may include names, titles, relationships or contact addresses with the meeting attendees, and the like.

In step 303, during the conference, the voice of the speaker is collected and voiceprint information is extracted.

In step 304, the voiceprint information of the speaker is matched with the voiceprint features in the voiceprint library, the matched voiceprint features are determined, and then the identity information associated with the voiceprint features is obtained.

In step 305, the identity information of the speaker is video-composited with the video frame. The synthesized video picture has identity information of the speaker. The video pictures after the synthesis processing can be sent to each terminal of the conference, so that the participants can know the identity information of the speaker while watching the video pictures.

By the method, the voiceprint library comprising the voiceprint characteristics of the participants can be generated, the voiceprint information is identified based on the voiceprint library, the voiceprint information of the speaker can be identified quickly and effectively, the identity information of the speaker is determined according to the incidence relation between the voiceprint characteristics and the identity information, the operation efficiency is improved, and the method is convenient to popularize and apply.

In one embodiment, when the voiceprint features are extracted according to the voices of the participants to generate the voiceprint library, the recorded voiceprint features can be stored in groups according to the meeting place arrangement of the participants, so that when the voiceprint information of a speaker is identified, the meeting place where the speaker is located can be judged according to the sound source, and then the voiceprint information of the speaker is identified according to the voiceprint features in the meeting place groups. The method greatly reduces the operation amount of voiceprint information identification and improves the operation efficiency.

In one embodiment, the speaker can be positioned and labeled according to the video picture, so that the participants can see the speaker more intuitively, and the user experience is further improved.

A flow chart of yet another embodiment of the video conference data processing method of the present invention is shown in fig. 4.

In step 401, voiceprint features of the participant's voice are extracted based on the recorded participant's voice, and a voiceprint library is generated. In one embodiment, each participant may be asked to enter a voice before the meeting begins. In another embodiment, only the voices of the participants who have not extracted the voiceprint feature may be entered.

In step 402, the voiceprint characteristics of the participant are associated with the identity information of the participant. In one embodiment, the identity information of each participant can be entered in advance, and the voiceprint features and the identity information can be associated in a voice entry or voiceprint feature extraction process. In one embodiment, the identity information may include names, titles, relationships or contact addresses with the meeting attendees, and the like.

In step 403, facial features of the participant are extracted and associated with the voiceprint features. Facial features of the participants can be collected through the photos uploaded by the participants, or the facial features of the participants can be collected while the voices of the participants are input.

In step 404, during the conference, the voice of the speaker is collected and voiceprint information is extracted.

In step 405, the voiceprint information of the speaker is matched with the voiceprint features in the voiceprint library to determine the matched voiceprint features.

In step 406, identity information and facial features associated with the voiceprint feature are obtained.

In step 407, the speaker is located in the video frame and the location identifier is added, and the location identifier and the identity information of the speaker are video-combined with the video frame, so as to transmit the combined video frame to each terminal.

By the method, the facial features of the participants can be collected, the speaker can be determined by taking the voiceprint features as the identification, and the speaker is positioned and labeled in the video picture, so that the participants can know the identity information of the speaker and can know the speaker more intuitively, the video conference is more humanized, and the user experience is further improved. Especially under the virtual reality video conference scene, can fix a position the speaker fast, reach the effect of face-to-face communication.

A schematic diagram of one embodiment of a video conferencing platform of the present invention is shown in fig. 5. The voiceprint information extraction module 501 can obtain the voiceprint information of the speaker. In one embodiment, the voiceprint information extraction module 501 can perform audio data processing on the sound collected from the microphone to obtain the voiceprint information of the speaker. Identity information determination module 502 is capable of recognizing voiceprint information and determining identity information of a speaker. In one embodiment, the identity information determining module 502 may perform feature matching on the voiceprint information of the speaker according to the voiceprint features of the participants to determine the matched voiceprint features, thereby determining the voiceprint information of the speaker. The video synthesis module 503 is configured to perform video synthesis on the identity information of the speaker and the video frame, where the synthesized video frame has the identity information of the speaker. The video pictures after the synthesis processing can be sent to each terminal of the conference, so that the participants can know the identity information of the speaker while watching the video pictures.

The video conference platform can identify the identity of the speaker according to the voice of the participants and display the identity of the speaker to the participants through the video pictures, so that the participants can conveniently identify the identity of the speaker, and the user experience of the video conference is improved.

In one embodiment, the video pictures can be virtual reality video pictures, virtual reality scenes can be created in related meeting places through the virtual reality video display terminals, or atmosphere of a live meeting is created in a mode that participants wear virtual reality video display glasses, and meeting experience is improved.

A schematic diagram of another embodiment of the video conferencing platform of the present invention is shown in fig. 6. The voiceprint information extraction module 61 is configured to obtain voiceprint information of a speaker. The identity information determining module 62 comprises a voiceprint matching unit 621 and an identity information obtaining unit 622, wherein the voiceprint matching unit 621 can perform feature matching according to the voiceprint information to identify voiceprint features matched with the voiceprint information; the identity information obtaining unit 622 can obtain the identity information corresponding to the matched voiceprint feature. The identity information of the speaker may include the speaker's name, identity, and associated information with the conference. The identity and the position of the speaker can be more intuitively positioned according to the information. The identity information of the speaker can also comprise contact ways such as the telephone number of the speaker and the like, so that the participants can conveniently and directly communicate after meeting. The video synthesizing module 63 includes a video synthesizing unit 631 and a video sending unit 632, the video synthesizing unit 631 can perform video synthesis on the acquired identity information and the video picture, and the video sending unit 632 can send the synthesized video picture to the video terminal to be displayed to the participants.

The platform can identify the speaker according to the voice of the participants, inquire the relevant identity information of the speaker, and display the identity information to the participants through the video pictures, so that the participants can know the identity of the speaker and the background information of the speaker better, and the user experience of the video conference is improved.

A schematic diagram of yet another embodiment of a video conferencing platform of the present invention is shown in fig. 7. The structures and functions of the voiceprint information extraction module 701, the identity information determination module 702, and the video composition module 703 are similar to those in the embodiment of fig. 5. The video conferencing platform also includes a voiceprint feature extraction module 704 and an identity information association module 705. The voiceprint feature extraction module 704 can acquire the voiceprint features of the voices of the participants based on the recorded voices of the participants to generate a voiceprint library; identity information association module 705 associates the voiceprint characteristics of the participant with the identity information of the participant. In one embodiment, the identity information of each participant can be recorded, and the voiceprint features and the identity information are associated in the voice recording or voiceprint feature extraction process. In one embodiment, the identity information may include names, titles, relationships or contact addresses with the meeting attendees, and the like.

The platform can generate a voiceprint library comprising voiceprint characteristics of participants, and can identify voiceprint information based on the voiceprint library, so that voiceprint information of a speaker can be quickly and effectively identified, identity information of the speaker is determined according to incidence relation between the voiceprint characteristics and the identity information, operation efficiency is improved, and popularization and application are facilitated.

In one embodiment, when the voiceprint feature extraction module 704 extracts the voiceprint features according to the sound of the attendees to generate the voiceprint library, the recorded voiceprint features can be stored in groups according to meeting place arrangement of the attendees, so that when the voiceprint information of the speaker is identified, the meeting place where the speaker is located can be judged according to the sound source, and then the voiceprint information of the speaker can be identified according to the voiceprint features in the meeting place group. The platform greatly reduces the calculation amount of voiceprint information identification and improves the calculation efficiency.

In one embodiment, the video conference platform can also position and mark the speaker according to the video picture, so that the participants can see the speaker more intuitively, and the user experience is further improved.

A schematic diagram of yet another embodiment of a video conferencing platform of the present invention is shown in fig. 8. The voiceprint feature extraction module 804 is configured to extract voiceprint features of the voices of the participants based on the recorded voices of the participants, and generate a voiceprint library. Identity information association module 805 is configured to associate the voiceprint characteristics of the participant with the identity information of the participant. The facial feature extraction module 806 may be capable of extracting facial features of the participant, extracting facial features of the participant from a photograph uploaded by the participant, or extracting facial features of the participant while recording the participant's voice. The facial feature association module 807 is used to associate facial features of the participant with voiceprint features.

The voiceprint information extraction module 801 is configured to extract voiceprint information according to the collected voice of the speaker during the conference. Identity information determination module 802 is configured to match voiceprint information of a speaker with voiceprint features in a voiceprint library, determine the matched voiceprint features, and obtain voiceprint information associated with the voiceprint features. The facial feature acquisition module 808 is used to acquire facial feature information associated with the voiceprint feature. The facial feature positioning module 809 is used for performing positioning operation in the video picture according to the acquired facial feature information of the speaker. The video composition module 803 is configured to perform video composition on the positioning identifier of the speaker and the identity information of the speaker and the video frame, so as to transmit the video frame after composition to each terminal.

The platform can analyze the facial features of the participants, determine the speaker by taking the voiceprint features as the identification, position and mark the speaker in the video picture, enable the participants to know the identity information of the speaker and know the speaker more intuitively, enable the video conference to be more humanized and further improve the user experience. Especially under the virtual reality video conference scene, can fix a position the speaker fast, reach the effect of face-to-face communication.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and not to limit it; although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art will understand that: modifications to the specific embodiments of the invention or equivalent substitutions for parts of the technical features may be made; without departing from the spirit of the present invention, it is intended to cover all aspects of the invention as defined by the appended claims.

Claims

1. A method for processing videoconference data, comprising:

extracting voiceprint characteristics of the voice of the participants based on the recorded voice of the participants to generate a voiceprint library;

associating the voiceprint characteristics of the participant's voice with the identity information of the participant;

acquiring voiceprint information of a speaker;

recognizing the voiceprint information and determining the identity information of the speaker;

acquiring the facial features of the speaker according to the incidence relation between the voiceprint features and the facial features;

locating the speaker in a video picture according to the facial features of the speaker;

and carrying out video synthesis on the identity information of the speaker and the positioning identification of the speaker and the video picture to display the video picture after the video synthesis, wherein a virtual reality scene is built in a meeting place through a virtual reality video display terminal, or a live meeting is built in a mode that participants wear virtual reality video display glasses, the video picture is a virtual reality video picture, and the video terminal is a virtual reality video display terminal.

2. The method of claim 1, wherein the identifying the voiceprint information and determining the identity information of the speaker comprises:

performing feature matching according to the voiceprint information, and identifying voiceprint features matched with the voiceprint information;

and acquiring the identity information of the speaker corresponding to the matched voiceprint characteristics.

3. The method of claim 2,

the video synthesis and display of the identity information of the speaker and the video picture comprises:

carrying out video synthesis on the identity information of the speaker and a video picture;

and sending the video picture after the video synthesis to a video terminal for display.

4. The method of claim 1, further comprising:

extracting facial features of the participants;

associating the facial features of the participant with the voiceprint features.

5. A video conferencing platform, comprising:

the voiceprint feature extraction module is used for extracting voiceprint features of the voices of the participants based on the recorded voices of the participants to generate a voiceprint library;

the identity information correlation module is used for correlating the voiceprint characteristics of the sound of the participant with the identity information of the participant;

the voiceprint information extraction module is used for acquiring the voiceprint information of the speaker;

the identity information determining module is used for identifying the voiceprint information and determining the identity information of the speaker;

the facial feature acquisition module is used for acquiring the facial features of the speaker according to the incidence relation between the voiceprint features and the facial features;

the facial feature positioning module is used for positioning the speaker in a video picture according to the facial features of the speaker;

the video synthesis module is used for carrying out video synthesis on the identity information of the speaker and the positioning identification of the speaker and the video picture so as to display the video picture after the video synthesis, and comprises a virtual reality scene built in a meeting place through a virtual reality video display terminal or a live meeting built in a mode that participants wear virtual reality video display glasses, wherein the video picture is a virtual reality video picture, and the video terminal is a virtual reality video display terminal.

6. The platform of claim 5, wherein the identity information determination module comprises:

the voiceprint matching unit is used for carrying out feature matching according to the voiceprint information and identifying matched voiceprint features;

and the identity information acquisition unit is used for acquiring the identity information of the speaker corresponding to the matched voiceprint characteristics.

7. The platform of claim 6,

the video composition module includes:

the video synthesis unit is used for carrying out video synthesis on the identity information of the speaker and a video picture;

and the video sending unit is used for sending the video picture after the video synthesis to the video terminal for displaying.

8. The platform of claim 5, further comprising:

the facial feature extraction module is used for extracting facial features of the participants;

a facial feature association module for associating facial features of the participant with voiceprint features.