CN113920560A

CN113920560A - Method, device and equipment for identifying identity of multi-modal speaker

Info

Publication number: CN113920560A
Application number: CN202111092312.6A
Authority: CN
Inventors: 程虎; 殷保才; 刘文超; 李渊强
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2022-01-11

Abstract

The invention discloses a method, a device and equipment for identifying the identity of a multi-modal speaker, wherein the method comprises the following steps: acquiring video data and audio data of a session scene; performing face detection and lip detection on the video data to obtain sub-video data of participants and face frame data and lip frame sequences in the sub-video data; determining speakers in all the participants and audio data corresponding to the speakers according to the lip-shaped frame sequences of the participants and the audio data; extracting visual characteristics of the speaker according to the face frame data of the speaker, and extracting audio characteristics of the speaker according to audio data corresponding to the speaker; and identifying the identity of the speaker according to the visual characteristic and the audio characteristic. The method and the device can improve the accuracy of speaker identity recognition under complex and various conversation scenes.

Description

Method, device and equipment for identifying identity of multi-modal speaker

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for recognizing the identity of a multi-modal speaker.

Background

Generally, meeting records need a recorder to record and arrange a meeting into a meeting abstract, and strong professional knowledge and labor input are needed. The speaker separation technology solves the problem of 'who speaks when', can automatically separate the contents of different speakers, and matches the speaking content of each person with the identity information registered in advance to generate a conference record, so that the improvement of the identity recognition capability of the speakers is particularly important in a conversation scene.

Because the voice is convenient to obtain, the speaker identity recognition of the current conversation scene mainstream is mainly based on voiceprint characteristics. In order to match the voiceprint with the identity of the target person, the participant often needs to register a voiceprint library in advance. The whole process mainly comprises the following steps: separating the content of the speaker by using a speaker separation technology, and then extracting voiceprint characteristics of the same speaker to be matched with a voiceprint library which is registered in advance to determine the identity information of the speaker.

The face recognition technology is rarely applied to conversation scenes, and mainly because the conversation scenes have a large number of side faces, back faces and other scenes, and in addition, the distance between a large conference room target and a camera is too far, clear face features are difficult to obtain. The face recognition technology mainly comprises the steps of firstly obtaining a target area through face detection, and then extracting features of a face area to be matched with a face library. In a near-field scene, clear facial features can be acquired, but voiceprints are easily affected by background noise, the change of speaking tone of a target person and the like, so that the face recognition often has a better recognition effect.

The accuracy of the speaker recognition technology based on the voiceprint is closely related to the size, gender distribution, environmental noise and the like of a voiceprint library. When the voiceprint library is large, the accuracy of voiceprint matching is obviously reduced; in addition, the ability of voiceprints to distinguish between the same genders is inferior to that of the opposite genders. When the environmental noise is strong, the identification effect of the speaker can be influenced. Although the face recognition technology has high accuracy, the application scene of the face recognition technology is mainly near field because the face features with distinctiveness can be extracted. For a complex scene of a conference room with shielding, walking, long distance and the like, it is difficult to ensure that clear facial features of a target person can be obtained at every moment, and the identity recognition effect based on the facial features is somewhat unsatisfactory.

For the bank transaction scene, multi-modal multi-azimuth authentication is adopted simultaneously, and identity recognition accuracy is improved. However, a bank scene generally defaults that only one person performs authentication at the same time, namely a man-machine interaction mode is needed, a face area is manually placed in an induction frame of equipment, and a multi-mode authentication mode also basically adopts a cascade mode: after the first mode authentication is passed, the next mode authentication is carried out; if the former mode fails, the device will remind and continue to verify. And the conversation scene is a more free scene, and if a man-machine interaction mode is adopted, the popularization in the conversation scene is difficult. In addition, a conversation scene usually has multiple participants, and how to match the audio of the current moment with the facial features of the participants is also a problem to be solved urgently by the task of recognizing the identity of the multi-modal speaker in the conversation scene.

At present, some fused multi-modal information is used for identifying the identity of the participants, and a solution is provided for the problem. There are still a number of problems with these solutions. For example, some solutions utilize participant expressions to determine the current speaker, but such positioning is not very accurate, especially in real scenes, where the speaker may not have rich expressive features and it is difficult to associate facial expressions with pre-speech. In addition, some solutions use microphone alignment, speaker localization using sound source localization, and binding corresponding visual features using localized positions. Although sound source localization can achieve speaker localization, when the speaker angle is small, the speaker is difficult to distinguish. In addition, the scheme also puts forward a greater requirement on hardware, a multi-channel microphone array needs to be configured for conference equipment, and different types of arrays need to be customized, so that the popularization of products is greatly influenced.

Disclosure of Invention

The present application has been made to solve at least one of the above problems. According to an aspect of the present application, there is provided a method for identifying identities of multimodal speakers, the method including: acquiring video data and audio data of a session scene; performing face detection and lip detection on the video data to obtain sub-video data of participants and face frame data and lip frame sequences in the sub-video data; determining speakers in all the participants and audio data corresponding to the speakers according to the lip-shaped frame sequences of the participants and the audio data; extracting visual characteristics of the speaker according to the face frame data of the speaker, and extracting audio characteristics of the speaker according to audio data corresponding to the speaker; and identifying the identity of the speaker according to the visual characteristic and the audio characteristic. Wherein the session scenario may be a conference scenario.

In one embodiment of the present application, the determining the speaker of all participants and the audio data corresponding to the speaker according to the sequence of the lip-shaped boxes and the audio data of the participants comprises: inputting the audio data of the conversation scene into a trained multi-modal speaker detection model in a sliding window mode; and polling the lip-shaped frame sequences in the sub-video data of all the participants by the trained multi-modal speaker detection model aiming at the audio data in each sliding window so as to determine the speaker corresponding to the audio data in each sliding window.

In one embodiment of the present application, said polling the sequence of the lip-boxes in the sub-video data of all the participants to determine the speaker corresponding to the audio data in each sliding window comprises: performing the following operation on each frame data of the sub-video data of each participant: inputting M frame data before the frame data, the frame data and a lip-shaped frame sequence in the M frame data after the frame data into the trained multi-modal speaker detection model, wherein M is a natural number greater than 0; and extracting visual features from the lip-shaped frame sequence by the multi-modal speaker detection model, extracting audio features from the audio data in the sliding window, splicing and fusing the video features and the audio features, then extracting time sequence relation, and outputting the voice activation detection score of the frame data so as to determine whether the participant is the speaker corresponding to the audio data in the sliding window.

In one embodiment of the present application, the multimodal speaker detection model includes a video feature extraction network, an audio feature extraction network, and a long-term memory network.

In one embodiment of the present application, the identifying the speaker according to the visual characteristic and the audio characteristic includes: respectively matching the visual features and the audio features with features in a database to obtain matching results of the visual features and the audio features; determining a multi-mode fusion strategy according to the matching result of the visual characteristic and the matching result of the audio characteristic, and obtaining an identity recognition result of the speaker according to the determined multi-mode fusion strategy; wherein the multi-modal fusion strategy comprises: determining an identification result of the speaker according to both the matching result of the visual feature and the matching result of the audio feature; or determining the identification result of the speaker according to one of the matching result of the visual characteristic and the matching result of the audio characteristic.

In one embodiment of the present application, the identifying the speaker according to the visual characteristic and the audio characteristic includes: matching the visual features with features in a first database to obtain the first N identification marks matched with the visual features and the visual similarity corresponding to each identification mark, wherein N is a natural number and is greater than or equal to 1; matching the audio features with features in a second database to obtain the first N identity identifications matched with the audio features and audio similarity corresponding to each identity identification, wherein N is a natural number and is greater than or equal to 1; the determining a multi-modal fusion strategy according to the matching result of the visual features and the matching result of the audio features, and obtaining the identification result of the speaker according to the determined multi-modal fusion strategy, includes: when the same identification exists in the first N identifications matched with the visual features and the first N identifications matched with the audio features, calculating the weighted average value of the visual similarity and the audio similarity corresponding to each identification in the same identification, and determining the identification with the largest weighted average value in the same identifications as the identification result of the speaker; and when the same identification does not exist in the first N identifications matched with the visual features and the first N identifications matched with the audio features, determining the maximum value of the visual similarity and the audio similarity, and determining the identification corresponding to the maximum value as the identification result of the speaker.

In an embodiment of the present application, the extracting the visual characteristics of the speaker according to the face frame data of the speaker includes: extracting the characteristics of each face frame in the sub-video data of the speaker to obtain the visual characteristics of each face frame; and averaging the visual characteristics of all the face frames in the sub-video data of the speaker to obtain the visual characteristics of the speaker.

In an embodiment of the present application, the extracting audio features of the speaker according to the audio data corresponding to the speaker includes: performing sliding window processing on the audio data corresponding to the speaker; extracting audio features from the audio data in each sliding window; and averaging the audio features of the audio data in all the sliding windows of the audio data corresponding to the speaker to obtain the audio features of the speaker.

In one embodiment of the present application, the video data and the audio data are video data and audio data of the whole conference, or the video data and the audio data are data collected in real time during the conference.

According to another aspect of the present application, there is provided a multi-modal speaker identification apparatus, the apparatus comprising a memory and a processor, the memory having stored thereon a computer program executed by the processor, the computer program, when executed by the processor, causing the processor to execute the multi-modal speaker identification method.

According to still another aspect of the present application, there is provided a multi-modal speaker identification apparatus, the apparatus including an image acquisition device, a pickup device, and the multi-modal speaker identification device, wherein the image acquisition device is configured to acquire conference video data, the pickup device is configured to acquire conference audio data, and the multi-modal speaker identification device is configured to perform multi-modal speaker identification based on the conference video data and the conference audio data.

In one embodiment of the present application, the sound pickup apparatus is a single-channel microphone array.

According to the method, the device and the equipment for identifying the identity of the multi-modal speaker, the multi-modal VAD technology and the identity identification of the speaker are fused, the audio characteristics and the visual characteristics are matched, and the accuracy of the identity identification of the speaker can be improved in a complex and various conversation scene.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic flow diagram of a method for multi-modal speaker identity recognition according to an embodiment of the present application.

FIG. 2 is a schematic diagram illustrating a polling procedure of a multi-modal speaker detection model employed in a multi-modal speaker identification method according to an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating a processing flow of sub-video data of each participant by using a multi-modal speaker detection model in the multi-modal speaker identification method according to an embodiment of the present application.

Fig. 4 illustrates an exemplary diagram of a multi-modal fusion strategy employed in a multi-modal speaker identity recognition method according to an embodiment of the present application.

Fig. 5 is a schematic diagram summarizing an overall flow of a multi-modal speaker identity recognition method according to an embodiment of the present application.

Fig. 6 is a block diagram illustrating a schematic structure of a multi-modal speaker identification apparatus according to an embodiment of the present application.

Fig. 7 is a block diagram illustrating a schematic structure of a multi-modal speaker identification apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application described in the present application without inventive step, shall fall within the scope of protection of the present application.

FIG. 1 is a schematic flow chart diagram of a method 100 for identifying identities of multi-modal speakers in accordance with an embodiment of the present application. As shown in FIG. 1, the method 100 for multi-modal speaker identification may include the steps of:

in step S110, video data and audio data of a conversation scene are acquired.

In step S120, face detection and lip detection are performed on the video data to obtain sub-video data of the participant and face frame data and lip frame sequences in the sub-video data.

In step S130, the speakers of all participants and the audio data corresponding to the speakers are determined based on the lip box sequences and the audio data of the participants.

In step S140, the visual characteristics of the speaker are extracted according to the face frame data of the speaker, and the audio characteristics of the speaker are extracted according to the audio data corresponding to the speaker.

In step S150, the speaker is identified according to the visual characteristic and the audio characteristic.

In the embodiment of the application, the speakers in all the participants and the audio data corresponding to the speakers can be determined by extracting the face frame data and the lip frame sequence of the participants from the video data of the session scene, namely, the speakers are selected from the participants through the consistency of lip motion characteristics and audio characteristics, so that the audio data and the visual characteristics of the speakers are bound based on a multi-modal speaker Detection (Voice Activity Detection, abbreviated as VAD) technology, and the speakers are more accurately distinguished; and then, the identity of the speaker is identified according to the visual characteristic and the corresponding audio characteristic of the speaker so as to determine the identity of the speaker, and the identity identification result combining the visual characteristic and the audio characteristic of the speaker is more accurate due to the accurate identification of the speaker, so that various complex and diversified conversation scenes can be met.

In the embodiment of the present application, various open source models may be used for the face detection in step S120. The open source scheme for directly carrying out lip detection is less, but more lip key point detection models can be selected. The application does not limit which scheme is used for the detection model. For convenience, the solution of the present application may manually label the face frame and lip frame based on self-collected street and campus data, and simultaneously detect the face frame and lip frame based on yoloV5 framework. Where the lip box along with the audio data can be used to train a multimodal speaker detection model, the face box can be used to extract facial features, as will be described later.

In an embodiment of the present application, the determining the speaker and the audio data corresponding to the speaker among all the participants according to the lip box sequences and the audio data of the participants in step S130 may include: inputting audio data of a conversation scene into a trained multi-modal speaker detection model (hereinafter abbreviated as multi-modal VAD) in a sliding window mode; and polling the lip-shaped frame sequences in the sub-video data of all the participants by the trained multi-modal speaker detection model aiming at the audio data in each sliding window so as to determine the speaker corresponding to the audio data in each sliding window. Described below in conjunction with fig. 2.

As shown in fig. 2, the current audio data (audio data in a sliding window) and the sequence of the lip-boxes of participant 1 (referred to as lip sequence for short) are input to the multi-modal VAD, and the multi-modal VAD outputs a voice activity detection score (referred to as VAD score for short hereinafter). When the visual information (the sequence of the lip-shaped frames) of the participant 1 is inconsistent with the voice information (the audio data), the multi-modal VAD score is low, which indicates that the participant 1 is not the speaker of the current audio data, i.e. the current audio data is not matched with the participant 1; conversely, if the multi-modal VAD score is above the threshold, it indicates that the current audio data matches participant 1. If the current audio data matches participant 1, then no other participants (participant 2 through participant N lip box sequence) may need to be entered. If the current audio data does not match participant 1, the current audio data (audio data in a sliding window) and the sequence of the lip boxes of participant 2 are input to the multi-modal VAD, which outputs a VAD score, as described above for participant 1, and is not described here again. Likewise, if the current audio data matches participant 2, then no other participants (participant 3 through participant N lip box sequence) may need to be entered. If the current audio data does not match participant 2, then the current audio data (audio data in a sliding window) is input to the multi-modal VAD with the sequence of the lip boxes of participant 3, which will output a VAD score, and so on. Finally, the speaker corresponding to the audio data within each sliding window can be determined.

In an embodiment of the present application, the aforementioned polling the sequences of the lip-shaped boxes in the sub-video data of all the participants to determine the speaker corresponding to the audio data in each sliding window may include: performing the following operation for each frame data of the sub-video data of each participant: inputting M frame data before one frame data, one frame data and a lip shape frame sequence in the M frame data after the one frame data into the trained multi-modal speaker detection model, wherein M is a natural number larger than 0; and extracting visual features from the lip-shaped frame sequence by using the multi-modal speaker detection model, extracting audio features from the audio data in the sliding window, splicing and fusing the video features and the audio features, extracting time sequence relation, and outputting a voice activation detection score of frame data to determine whether the participant is the speaker corresponding to the audio data in the sliding window. Described below in conjunction with fig. 3.

Fig. 3 is a schematic diagram illustrating a processing flow of sub-video data of each participant by using a multi-modal speaker detection model in the multi-modal speaker identification method according to an embodiment of the present application. As shown in fig. 3, the multi-modal VAD may include a visual feature extraction network (denoted as visual encoder in fig. 3), an audio feature extraction network (denoted as audio encoder in fig. 3), and a long-term memory network (denoted as LSTM in fig. 3). Inputting a lip frame sequence of a participant into the visual feature extraction network, and extracting visual features from the visual feature extraction network; inputting the audio data in the sliding window into an audio feature extraction network, and extracting audio features by the audio feature extraction network; then, the video features and the audio features are spliced and fused, input into a long-time and short-time memory network to extract the time-series relation, and finally output the VAD Score (represented as VAD Score in fig. 3).

When the sequence of the lip-shaped frames is input, M frames before one frame, and the sequence of the lip-shaped frames in M frames after one frame may be input to the trained multi-modal VAD, where M is a natural number greater than 0. For example, in one example, M is equal to 2, i.e., 2 frames of data (5 frames of a total lip frame) before and after the current frame of data are input to the multi-modal VAD. Generally, the lip box may be scaled uniformly to a preset size, such as 80 x 80 size, and accordingly the input to the visual feature extraction network is 5 x 80. In one example, the window length of the audio sliding window is 2 seconds and the window shift is 1 second. In the embodiments of the present application, the visual feature extraction network and the audio feature extraction network are not limited. Illustratively, the visual feature extraction network may be 3D-resnet 18; the audio feature extraction network may be sincenet; the visual feature extraction network can perform parameter initialization based on an open source model trained by Lip reading data.

In general, each frame of data can be given a VAD score, and finally, based on the VAD score of each frame of data, whether a participant matches the current audio data can be determined. In this way, if the video data and the audio data acquired in step S110 are data collected in real time during a conference, steps S120 and S130 may determine a speaker corresponding to each collected audio data in real time. If the video data and the audio data obtained in step S110 are the video data and the audio data of the whole conference process (i.e. offline processing after the conference is finished), then steps S120 and S130 may sort out all speakers appearing in the audio data and the audio segment corresponding to each speaker. In addition, the unvoiced participant and the silence period are filtered through steps S120 and S130.

The following continues to describe steps S140 and S150 subsequent to the method 100.

In an embodiment of the present application, the extracting the visual characteristics of the speaker according to the face frame data of the speaker in step S140 may include: extracting the characteristics of each face frame in the sub-video data of the speaker to obtain the visual characteristics of each face frame; and averaging the visual characteristics of all the face frames in the sub-video data of the speaker to obtain the visual characteristics of the speaker.

Because the visual target obtained by the session scene is not a single-frame image but continuous video information, the sub-video data of the speaker (i.e. the video data of one participant (the participant is the speaker with at least one segment of audio data in the obtained audio data) in the obtained overall video data, for example, the sub-video data can be obtained by performing tracking matching on the face detection frame in time sequence based on the cross-over ratio IOU), and feature extraction is performed on each face frame (obtained in step S120) in the sub-video data, and the visual feature is updated according to a sliding average method, so as to obtain the visual feature of the speaker. Illustratively, the 2D-resnet18 training face recognition model is adopted for feature extraction, and each target is encoded into 512-dimensional feature vectors. In addition, before the visual features are extracted, preprocessing such as face alignment and color standardization can be carried out on the face frame, and an aligned face image is obtained.

In an embodiment of the present application, the extracting the audio feature of the speaker according to the audio data corresponding to the speaker in step S140 may include: performing sliding window processing on audio data corresponding to a speaker; extracting audio features from the audio data in each sliding window; and averaging the audio features of the audio data in all the sliding windows of the audio data corresponding to the speaker to obtain the audio features of the speaker.

In one example, the window length of the sliding window is 2 seconds and the window shift is 1 second. Regarding the extraction of audio features, the present application does not limit what kind of feature extraction network is used. Illustratively, the audio data for each sliding window may be encoded into a 512-dimensional feature vector using a text-independent x-vector extraction network. Then, for the audio features of the same speaker, the audio features of the speaker are obtained by using an averaging mode.

In an embodiment of the present application, the identification of the speaker according to the visual characteristic and the audio characteristic in step S150 may include: respectively matching the visual features and the audio features with the features in the database to obtain a matching result of the visual features and a matching result of the audio features; determining a multi-mode fusion strategy according to the matching result of the visual characteristic and the matching result of the audio characteristic, and obtaining an identity recognition result of the speaker according to the determined multi-mode fusion strategy; wherein the multi-modal fusion strategy comprises: determining the identity recognition result of the speaker according to the matching result of the visual characteristic and the matching result of the audio characteristic; or determining the identification result of the speaker according to one of the matching result of the visual characteristic and the matching result of the audio characteristic.

In this embodiment, after obtaining the visual feature and the audio feature of a speaker, the two features are not directly spliced and then matched with the features in the database (scheme one), or are directly matched with the features in the database and then weighted and fused (scheme two). In the present application, the applicant finds, through experiments, that the mode contribution degrees are different in different scenes, which is specifically represented as follows: when the background noise is large in a near-field scene, the visual features should be weighted more; conversely, when the far-field scene is small and the background noise is small, the audio features should be weighted more heavily. In the first scheme, the features are directly fused, the two modes have the same contribution degree to the result, and the result is usually suboptimal; the second scheme is based on the single-mode matching result to perform weighted fusion, so that the problems can be partially solved, but the weight of the weighted fusion is preset hyper-parameter, and the weighted fusion cannot be popularized to more complex scenes. Furthermore, when certain modal data is extremely unreliable, for example: extremely high noise scenes or where the face is not visible, one of the modalities should be favored to be selected instead of blindly fusing the multi-modal results. Based on the above findings, the present application proposes a fusion policy more suitable for a multimodal conversation scenario, which determines whether to determine the identification result according to the matching result of the visual features and the matching result of the audio features, or to determine the identification result of the speaker according to one of the visual features and the audio features. As described in further detail below.

In an embodiment of the present application, the identification of the speaker according to the visual characteristic and the audio characteristic in step S150 may include: matching the visual features with the features in the first database to obtain the first N identity marks matched with the visual features and the visual similarity corresponding to each identity mark, wherein N is a natural number and is more than or equal to 1; matching the audio features with features in a second database to obtain the first N identity identifications matched with the audio features and audio similarity corresponding to each identity identification, wherein N is a natural number and is more than or equal to 1; determining a multi-modal fusion strategy according to the matching result of the visual features and the matching result of the audio features, and obtaining an identification result of the speaker according to the determined multi-modal fusion strategy, which may include: when the same identification exists in the first N identifications matched with the visual features and the first N identifications matched with the audio features, calculating the weighted average value of the visual similarity and the audio similarity corresponding to the identification for each identification in the same identification, and determining the identification with the maximum weighted average value in the same identification as the identification result of the speaker; and when the same identification does not exist in the first N identifications matched with the visual features and the first N identifications matched with the audio features, determining the maximum value of the visual similarity and the audio similarity, and determining the identification corresponding to the maximum value as the identification result of the speaker.

Wherein the first N identifications matching with the visual features and the first N identifications matching with the audio features can be respectively represented as [ V [ ]₁,V₂,...,V_N],[A₁,A₂,...,A_N](ii) a The visual similarity corresponding to each identity and the audio similarity corresponding to each identity can be expressed as [ T [ ]_V1,T_V2,...,T_VN],[T_A1,T_A2,...,T_AN]. Then, the multimodal fusion strategy is: and performing judgment fusion based on the matching result, wherein the judgment fusion is expressed as the following formula:

the meaning of the above formula is: if the same ID exists in the TOP N (TOP-N) Identification (ID) respectively matched with the audio features and the video features, calculating similarity weighted fusion with consistent IDs, wherein the ID corresponding to the highest similarity after fusion is the final identification result; in addition, targets (described later in connection with the example shown in fig. 4) with particularly high similarity of one mode but low similarity of another mode are removed by TOP-N, so that accuracy can be improved. If the same ID does not exist in TOP-N matched with the audio characteristic and the video characteristic respectively, the method indicates that the result of matching of one mode is unreliable, errors can be generated if the result is directly fused, and the method directly selects Top-1 targets of two modes according to similarity, namely if T is_V1＞T_A1If the matching ID is V₁Otherwise is A₁。

Fig. 4 illustrates an exemplary diagram of a multi-modal fusion strategy employed in a multi-modal speaker identity recognition method according to an embodiment of the present application. As shown in fig. 4, the first three visual similarity names and the first three audio similarity names are given, where there are two identical IDs, i.e., the ID 2 and the ID 5, where the weighted fusion value of the visual similarity and the audio similarity of the ID 2 is 0.845, and the weighted fusion value of the visual similarity and the audio similarity of the ID 5 is 0.775 (taking the weighting coefficients of 0.5 as an example in this example). Therefore, the identification 2 is the final identification result.

Having described the method 100 for identifying the identity of a multi-modal speaker according to an embodiment of the present application by way of example, fig. 5 shows a schematic overview of the overall flow of the method for identifying the identity of a multi-modal speaker according to an embodiment of the present application, which can be understood based on the foregoing description in conjunction with fig. 5.

Based on the above description, according to the multi-modal speaker identity recognition method of the embodiment of the application, the multi-modal VAD technology and speaker identity recognition are fused, and audio features and visual features are matched; in addition, according to the multi-modal speaker identity recognition method disclosed by the embodiment of the application, the multi-modal information is used for speaker identity confirmation in a conversation scene, and a multi-modal feature fusion technology aiming at the conversation scene is provided, so that the accuracy of speaker identity recognition can be improved in a complex and various conversation scenes.

The solution of the present application, using multi-modal VAD techniques (consistency of audio features and lip motion features) to determine the current speaker, enables more accurate discrimination than the several multi-modal solutions mentioned in the background. In addition, this application requires lowly to hardware equipment, need not the customization and optimizes, only needs the pickup equipment and the camera of single channel, requires lowly to hardware, and the popularization nature is better. In addition, the multi-modal fusion strategy of the application can improve the accuracy of identity recognition when a certain modality is unreliable.

A schematic block diagram of a multi-modal speaker identification apparatus 600 according to an embodiment of the present application is described below with reference to fig. 6. As shown in fig. 6, the apparatus 600 for recognizing a multimodal speaker may include a memory 610 and a processor 620, wherein the memory 610 stores a computer program executed by the processor 620, and the computer program, when executed by the processor 620, causes the processor 620 to perform the method for recognizing a multimodal speaker according to an embodiment of the present application as described above. The detailed operation of the apparatus 600 for recognizing the identity of a multi-modal speaker according to the embodiments of the present application can be understood by those skilled in the art in conjunction with the foregoing description, and for brevity, the detailed description is omitted here, and only some main operations of the processor 620 are described.

In one embodiment of the application, the computer program, when executed by the processor 620, causes the processor 620 to perform the steps of: acquiring video data and audio data of a session scene; performing face detection and lip detection on the video data to obtain sub-video data of participants and face frame data and lip frame sequences in the sub-video data; determining speakers in all the participants and audio data corresponding to the speakers according to the lip-shaped frame sequences of the participants and the audio data; extracting visual characteristics of the speaker according to the face frame data of the speaker, and extracting audio characteristics of the speaker according to audio data corresponding to the speaker; and identifying the identity of the speaker according to the visual characteristic and the audio characteristic.

In one embodiment of the present application, the computer program, when executed by the processor 620, causes the processor 620 to determine the speaker of all participants and the audio data corresponding to the speaker based on the sequences of the lip-boxes of the participants and the audio data, comprising: inputting the audio data of the conversation scene into a trained multi-modal speaker detection model in a sliding window mode; and polling the lip-shaped frame sequences in the sub-video data of all the participants by the trained multi-modal speaker detection model aiming at the audio data in each sliding window so as to determine the speaker corresponding to the audio data in each sliding window.

In one embodiment of the present application, the computer program, when executed by the processor 620, causes the processor 620 to perform the polling of the sequence of the lip-boxes in the sub-video data of all participants to determine the speaker corresponding to the audio data in each sliding window, comprising: performing the following operation on each frame data of the sub-video data of each participant: inputting M frame data before the frame data, the frame data and a lip-shaped frame sequence in the M frame data after the frame data into the trained multi-modal speaker detection model, wherein M is a natural number greater than 0; and extracting visual features from the lip-shaped frame sequence by the multi-modal speaker detection model, extracting audio features from the audio data in the sliding window, splicing and fusing the video features and the audio features, then extracting time sequence relation, and outputting the voice activation detection score of the frame data so as to determine whether the participant is the speaker corresponding to the audio data in the sliding window.

In one embodiment of the present application, the computer program, when executed by the processor 620, causes the processor 620 to perform the identifying the speaker according to the visual characteristic and the audio characteristic, comprising: respectively matching the visual features and the audio features with features in a database to obtain matching results of the visual features and the audio features; determining a multi-mode fusion strategy according to the matching result of the visual characteristic and the matching result of the audio characteristic, and obtaining an identity recognition result of the speaker according to the determined multi-mode fusion strategy; wherein the multi-modal fusion strategy comprises: determining an identification result of the speaker according to both the matching result of the visual feature and the matching result of the audio feature; or determining the identification result of the speaker according to one of the matching result of the visual characteristic and the matching result of the audio characteristic.

In one embodiment of the present application, the computer program, when executed by the processor 620, causes the processor 620 to perform the identifying the speaker according to the visual characteristic and the audio characteristic, comprising: matching the visual features with features in a first database to obtain the first N identification marks matched with the visual features and the visual similarity corresponding to each identification mark, wherein N is a natural number and is greater than or equal to 1; matching the audio features with features in a second database to obtain the first N identity identifications matched with the audio features and audio similarity corresponding to each identity identification, wherein N is a natural number and is greater than or equal to 1; the determining a multi-modal fusion strategy according to the matching result of the visual features and the matching result of the audio features, and obtaining the identification result of the speaker according to the determined multi-modal fusion strategy, includes: when the same identification exists in the first N identifications matched with the visual features and the first N identifications matched with the audio features, calculating the weighted average value of the visual similarity and the audio similarity corresponding to each identification in the same identification, and determining the identification with the largest weighted average value in the same identifications as the identification result of the speaker; and when the same identification does not exist in the first N identifications matched with the visual features and the first N identifications matched with the audio features, determining the maximum value of the visual similarity and the audio similarity, and determining the identification corresponding to the maximum value as the identification result of the speaker.

In one embodiment of the present application, the computer program, when executed by the processor 620, causes the processor 620 to perform the extracting the visual feature of the speaker according to the face frame data of the speaker, including: extracting the characteristics of each face frame in the sub-video data of the speaker to obtain the visual characteristics of each face frame; and averaging the visual characteristics of all the face frames in the sub-video data of the speaker to obtain the visual characteristics of the speaker.

In an embodiment of the present application, when executed by the processor 620, the computer program causes the processor 620 to perform the extracting the audio feature of the speaker according to the audio data corresponding to the speaker, including: performing sliding window processing on the audio data corresponding to the speaker; extracting audio features from the audio data in each sliding window; and averaging the audio features of the audio data in all the sliding windows of the audio data corresponding to the speaker to obtain the audio features of the speaker.

A schematic block diagram of a multi-modal speaker identification apparatus 700 according to an embodiment of the present application is described below with reference to fig. 7. As shown in fig. 7, the multimodal speaker identification apparatus 700 may include an image capturing device 710, a sound collecting device 720 and an identity recognition device 730, wherein the image capturing device 710 is used for capturing conference video data, the sound collecting device 720 is used for capturing conference audio data, and the identity recognition device 730 is the multimodal speaker identification device 600 described above, which is used for performing multimodal speaker identification based on the conference video data and the conference audio data. The detailed operation of the multi-modal speaker identification device 700 according to the embodiments of the present application can be understood by those skilled in the art in light of the foregoing description, and therefore, for brevity, will not be described in detail herein.

As in the foregoing, the scheme of this application requires lowly to hardware equipment, need not the customization and optimizes, only needs pickup equipment and the camera of single channel, requires lowly to hardware, and the popularization nature is better. Thus, in one embodiment of the present application, the sound pickup 720 may be a single channel microphone array.

Further, according to an embodiment of the present application, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the corresponding steps of the method for recognizing the identity of the multi-modal speaker according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The storage medium may be any combination of one or more computer-readable storage media.

Furthermore, according to an embodiment of the present application, there is provided a computer program, which is executed by a computer or a processor to perform the corresponding steps of the method for recognizing the identity of a multi-modal speaker according to an embodiment of the present application.

Based on the above description, according to the method, the device and the equipment for identifying the identity of the multi-modal speaker in the embodiment of the application, the multi-modal VAD technology is fused with the identity identification of the speaker, and the audio characteristics are matched with the visual characteristics; in addition, according to the method, the device and the equipment for recognizing the identity of the multi-modal speaker, the identity of the speaker is confirmed by using multi-modal information in a conversation scene, a multi-modal feature fusion technology aiming at the conversation scene is provided, and the accuracy of the identity recognition of the speaker can be improved in a complex and various conversation scenes.

Although the example embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above-described example embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present application. All such changes and modifications are intended to be included within the scope of the present application as claimed in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present application should not be construed to reflect the intent: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules according to embodiments of the present application. The present application may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application or the description thereof, and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope disclosed in the present application, and shall be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying identities of multimodal speakers, the method comprising:

acquiring video data and audio data of a session scene;

performing face detection and lip detection on the video data to obtain sub-video data of participants and face frame data and lip frame sequences in the sub-video data;

determining speakers in all the participants and audio data corresponding to the speakers according to the lip-shaped frame sequences of the participants and the audio data;

extracting visual characteristics of the speaker according to the face frame data of the speaker, and extracting audio characteristics of the speaker according to audio data corresponding to the speaker;

and identifying the identity of the speaker according to the visual characteristic and the audio characteristic.

2. The method of claim 1, wherein determining the speaker of all the participants and the audio data corresponding to the speaker based on the sequence of the lip boxes of the participants and the audio data comprises:

inputting the audio data of the conversation scene into a trained multi-modal speaker detection model in a sliding window mode;

and polling the lip-shaped frame sequences in the sub-video data of all the participants by the trained multi-modal speaker detection model aiming at the audio data in each sliding window so as to determine the speaker corresponding to the audio data in each sliding window.

3. The method of claim 2, wherein polling the sequence of the lip-boxes in the sub-video data of all participants to determine the speaker corresponding to the audio data in each sliding window comprises: performing the following operation on each frame data of the sub-video data of each participant:

inputting M frame data before the frame data, the frame data and a lip-shaped frame sequence in the M frame data after the frame data into the trained multi-modal speaker detection model, wherein M is a natural number greater than 0;

and extracting visual features from the lip-shaped frame sequence by the multi-modal speaker detection model, extracting audio features from the audio data in the sliding window, splicing and fusing the video features and the audio features, then extracting time sequence relation, and outputting the voice activation detection score of the frame data so as to determine whether the participant is the speaker corresponding to the audio data in the sliding window.

4. The method of claim 3, wherein the multimodal speaker detection model comprises a video feature extraction network, an audio feature extraction network, and a long-term memory network.

5. The method of claim 1, wherein said identifying the speaker based on the visual characteristics and the audio characteristics comprises:

respectively matching the visual features and the audio features with features in a database to obtain matching results of the visual features and the audio features;

determining a multi-mode fusion strategy according to the matching result of the visual characteristic and the matching result of the audio characteristic, and obtaining an identity recognition result of the speaker according to the determined multi-mode fusion strategy;

wherein the multi-modal fusion strategy comprises: determining an identification result of the speaker according to both the matching result of the visual feature and the matching result of the audio feature; or determining the identification result of the speaker according to one of the matching result of the visual characteristic and the matching result of the audio characteristic.

6. The method of claim 1 or 5, wherein said identifying the speaker according to the visual characteristic and the audio characteristic comprises:

matching the visual features with features in a first database to obtain the first N identification marks matched with the visual features and the visual similarity corresponding to each identification mark, wherein N is a natural number and is greater than or equal to 1;

matching the audio features with features in a second database to obtain the first N identity identifications matched with the audio features and audio similarity corresponding to each identity identification, wherein N is a natural number and is greater than or equal to 1;

the determining a multi-modal fusion strategy according to the matching result of the visual features and the matching result of the audio features, and obtaining the identification result of the speaker according to the determined multi-modal fusion strategy, includes:

when the same identification exists in the first N identifications matched with the visual features and the first N identifications matched with the audio features, calculating the weighted average value of the visual similarity and the audio similarity corresponding to each identification in the same identification, and determining the identification with the largest weighted average value in the same identifications as the identification result of the speaker;

and when the same identification does not exist in the first N identifications matched with the visual features and the first N identifications matched with the audio features, determining the maximum value of the visual similarity and the audio similarity, and determining the identification corresponding to the maximum value as the identification result of the speaker.

7. The method of claim 1, wherein extracting visual features of the speaker from the face frame data of the speaker comprises:

extracting the characteristics of each face frame in the sub-video data of the speaker to obtain the visual characteristics of each face frame;

and averaging the visual characteristics of all the face frames in the sub-video data of the speaker to obtain the visual characteristics of the speaker.

8. The method according to claim 1, wherein the extracting the audio feature of the speaker according to the audio data corresponding to the speaker comprises:

performing sliding window processing on the audio data corresponding to the speaker;

extracting audio features from the audio data in each sliding window;

and averaging the audio features of the audio data in all the sliding windows of the audio data corresponding to the speaker to obtain the audio features of the speaker.

9. The method of claim 1, wherein the video data and audio data are video data and audio data of an entire conference or wherein the video data and audio data are data collected in real time during a conference.

10. A multi-modal speaker identification apparatus, the apparatus comprising a memory and a processor, the memory having stored thereon a computer program for execution by the processor, the computer program, when executed by the processor, causing the processor to perform the multi-modal speaker identification method as claimed in any one of claims 1 to 9.

11. A multi-modal speaker identification apparatus, the apparatus comprising image capturing means for capturing conference video data, pickup means for capturing conference audio data, and the multi-modal speaker identification apparatus of claim 10 for performing multi-modal speaker identification based on the conference video data and the conference audio data.

12. The apparatus of claim 11, wherein the sound pickup device is a single channel microphone array.

13. A storage medium having stored thereon program instructions which, when executed by a computer or processor, perform a method of multimodal speaker identity recognition as claimed in any one of claims 1-9.