CN112906544A

CN112906544A - Voiceprint and face-based matching method suitable for multiple targets

Info

Publication number: CN112906544A
Application number: CN202110174056.9A
Authority: CN
Inventors: 梁哲辉; 梁东贵; 曾宪毅; 李紫楠; 李韫莛; 陈敏; 熊伟; 陈光辉; 李莹; 李永恩
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-04

Abstract

The invention provides a matching method based on voiceprint and human face and suitable for multiple targets, which relates to the field of pattern matching and multiple target matching of service robots and the like in a business hall environment, and comprises the following steps: (1) extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the features extracted to obtain voiceprint cepstrums of a plurality of targets within the period of time; (2) extracting moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting image characteristics to obtain the cepstrum of the moving images of different faces in the period of time; (3) similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, and a target which is not successfully matched is regarded as abnormal. The invention overcomes the challenges of single utilization of biological information and the need of inputting the biological information into a database in advance in biological identification and authentication, can identify a new target and simultaneously ensures multi-target matching.

Description

Voiceprint and face-based matching method suitable for multiple targets

Technical Field

The invention belongs to the field of pattern recognition and multi-target matching, and particularly relates to a matching method based on voiceprint and human face, which is suitable for multiple targets, for a service robot and the like in a business hall environment.

Background

With the development of artificial intelligence technology, biological identification is paid more and more attention, but the current biological identification method on the market is single, or only can authenticate biological information existing in a database, and cannot be applied to complex scenes matched with multi-target authentication.

Most of the existing authentication methods based on voiceprints and human faces on the market need to input voiceprint information and human face information of a user into a database in advance, then new voiceprint information and human face information of the user are collected in the authentication process, and then the new voiceprint information and the new human face information are compared with the information input into the database, and if the new voiceprint information and the new human face information are consistent with the information input into the database, the authentication is considered to be successful. The authentication method needs to input the biological information of the user in advance, so that a new user cannot be identified and matched with multiple users, multi-target matching under a complex scene (such as a service business hall) is difficult to solve, and the user to be served can be accurately found.

When the service robot is in the presence of multiple users who are speaking, conventional techniques cannot distinguish which of the users are the users to be serviced. The invention introduces the voiceprint recognition technology and the face recognition technology, determines the target service object by comparing the voiceprint change of each voice user with the face change of the user, avoids the service from being interfered by irrelevant voice, accurately recognizes the target user in a complex multi-user environment and completes continuous interaction.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and mainly solves the problems by the following technical methods:

a voiceprint and face based matching method suitable for multiple targets is characterized by comprising the following steps:

(1) and extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the extracted features to obtain voiceprint cepstrums of a plurality of targets within the period of time.

(2) And extracting the moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting the image characteristics to obtain the moving image cepstrum of different faces in the period of time.

(3) Similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, and a target which is not successfully matched is regarded as abnormal.

In the step (1), the sound segments within a period of time are extracted, and the specific method for extracting the features comprises the following steps:

the method comprises the steps of collecting sound in a period of time by using a microphone array, converting the sound of a speaker into feature vectors by using a deep neural network formed by continuous application of a plurality of nonlinear functions for the extracted sound segments, wherein one speaker corresponds to a plurality of feature vectors. The voiceprint characteristics can be obtained by:

wherein y represents the activation vector of the last hidden layer of the deep neural network, spk represents the current speaker,

representing all possible speakers contained in the presented sound clip.

In the step (1), the specific method for separating speakers from the mixed sound source after the characteristics are extracted is as follows:

the voiceprint features obtained by the steps are mixed and possibly comprise different voiceprint fragment features of the same speaker, in order to separate speakers from the mixed voiceprint features, a supervised cyclic neural network sharing parameters is used for obtaining the probability that the different voiceprint fragment features belong to the same speaker, and when the probability is greater than a threshold value, the voiceprint fragment features can be considered to belong to the speaker. The probability can be obtained by:

wherein X is (X)₁，x₂，…，x_T) Representing a voiceprint fragment feature obtained from claim 2, y ═ (y)₁，y₂，…，y_T) Representing the speaker tag corresponding to the voiceprint fragment feature. According to the result of the separation of the speakers, the voiceprint fragment characteristics belonging to the same speaker are spliced together according to the time sequence to form the complete voiceprint characteristics of each speaker in the time.

In the step (1), the specific method for obtaining the voiceprint cepstrum of the multiple targets in the period of time is as follows:

and carrying out spectrum analysis on the complete voiceprint characteristics of each speaker in a period of time obtained by the steps based on the Mel cepstrum coefficient to form a result which is more beneficial to comparison and visualization. The voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained. The derivation of the coefficients of the mel-frequency cepstrum is obtained by:

X[k]＝FT_x[n]

in the step (2), a specific method for extracting a moving image in the same time and performing multi-face recognition on the image comprises the following steps:

the method comprises the steps of collecting images within a period of time by using a camera array to form a video, carrying out face recognition on an image sequence, marking key points on the corners of mouths in the faces, then segmenting multiple targets in the same image, and storing an independent image sequence for each face.

In the step (2), regarding each face image focusing on the width change between the mouth corners in the face, the specific method for extracting the image features is as follows:

and performing feature extraction on the image sequence of each face obtained in the steps, and performing feature extraction on the part of the key point of the mouth angle by adopting a convolutional neural network to obtain a feature face sequence of the mouth angle change corresponding to each face. The operation of the convolutional neural network is represented by:

where m is a parameter related to the previous layer, s is the size of the convolution kernel, and u and v are weight sharing parameters.

In the step (2), the specific method for obtaining the moving image cepstrum of different faces in the period of time is as follows:

and performing motion analysis on the characteristic face sequence obtained in the step by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing spectrum analysis based on a Mel-cepstrum coefficient on the characteristic vectors. The interframe difference method is represented by the following equation:

d (x, y) is a difference image between two consecutive frames of images, I (T) and I (T-1) are images at time T and time T-1, respectively, T is a threshold value selected during binarization of the difference image, D (x, y) 1 represents a foreground, and D (x, y) 0 represents a background.

In the step (3), similarity matching is performed on the voiceprint cepstrum and the face cepstrum, a target with successful matching is regarded as a service object, and a specific method for considering that matching is unsuccessful is as follows:

and setting a threshold, and when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, considering the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum as matching, and considering the successfully matched target as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the voiceprint cepstrum and the face cepstrum is measured by the cosine similarity of the following formula:

wherein A is a voiceprint cepstrum and B is a face cepstrum.

Drawings

FIG. 1 is a general flowchart of a voiceprint and face based matching method for multiple targets according to an embodiment;

fig. 2 is an application scene diagram of a voiceprint and face-based matching method suitable for multiple targets according to an embodiment.

Detailed Description

The following describes in detail a specific embodiment of a voiceprint and face based matching method for multiple targets according to the present invention with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Examples

As shown in fig. 1, an overall flow diagram of a voiceprint and face based matching method for multiple targets is depicted; and fig. 2 depicts an application scenario diagram of a voiceprint and face based matching method suitable for multiple targets according to an embodiment.

A voiceprint and face based matching method suitable for multiple targets comprises the following steps:

1. and the service robot extracts the sound segments within a period of time, performs feature extraction, and performs speaker separation on the mixed sound source with the extracted features to obtain voiceprint cepstrums of a plurality of targets within the period of time.

1.1 the built-in microphone array of the service robot is used for collecting the sound in a period of time, for the extracted sound segment, the deep neural network formed by continuous application of a plurality of nonlinear functions is used for converting the sound of a speaker into feature vectors, one speaker corresponds to a plurality of feature vectors, the process can realize the elimination of the environmental noise, and only the voiceprint feature of the speaker is extracted. The voiceprint characteristics can be obtained by:

representing all possible speakers contained in the presented sound clip.

1.2 the voiceprint features obtained from 1.1 are mixed, possibly containing different voiceprint fragment features of the same speaker, in order to separate speakers from the mixed voiceprint features, a supervised cyclic neural network sharing parameters is used to obtain the probability that different voiceprint fragment features belong to the same speaker, and when the probability is greater than a threshold value, the voiceprint fragment features can be considered to belong to the speaker. The probability can be obtained by:

1.3 carrying out spectrum analysis on the complete voiceprint characteristics of each speaker obtained in the step 1.2 in a period of time based on the Mel cepstrum coefficient to form a result which is more beneficial to comparison and visualization. The voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained. The derivation of the coefficients of the mel-frequency cepstrum is obtained by:

X[k]＝FT_x[n]

2. the service robot extracts the moving images in the same period of time, performs multi-face recognition on the images, focuses on the width change between the mouth corners in the face for each face image, and extracts the image characteristics to obtain the moving image cepstrum of different faces in the period of time.

2.1 collecting images within a period of time by using a camera array built in the service robot to form a video, carrying out face recognition on an image sequence, marking key points on mouth corners in the faces, then segmenting multiple targets in the same image, and storing an independent image sequence for each face.

And 2.2, performing feature extraction on the image sequence of each face obtained in the step 2.1, and performing feature extraction on the part of the key point of the mouth angle by adopting a convolutional neural network to obtain a feature face sequence of the mouth angle change corresponding to each face. The operation of the convolutional neural network is represented by:

And 2.3, performing motion analysis on the characteristic face sequence obtained in the step 2.2 by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing spectrum analysis based on a Mel cepstrum coefficient on the characteristic vectors. The interframe difference method is represented by the following equation:

3. Similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, the service robot actively provides service before starting, the target which is unsuccessfully matched is regarded as abnormal, and the service robot continuously searches for the target which needs service. Specifically, a threshold is set, when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum are regarded as matching, and the target successfully matched is regarded as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the audio cepstrum and the face cepstrum is measured by the cosine similarity of the following formula:

wherein A is a voiceprint cepstrum and B is a face cepstrum.

5. The invention provides a matching method based on voiceprint and human face, which is suitable for multiple targets, overcomes the challenges that the utilization of biological information is single and the biological information needs to be input into a database in advance in biological identification and authentication, can identify new targets and simultaneously ensure multi-target matching, and has wide application prospects in the fields of biological identification and authentication, business hall service robots and the like.

Claims

1. A voiceprint and face based matching method suitable for multiple targets is characterized by comprising the following steps:

(1) extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the features extracted to obtain voiceprint cepstrums of a plurality of targets within the period of time;

(2) extracting moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting image characteristics to obtain the cepstrum of the moving images of different faces in the period of time;

2. The matching method based on the voiceprint and the human face suitable for multiple targets as claimed in claim 1, wherein in the step (1), the voice segments within a period of time are extracted, and the specific method for extracting the features comprises the following steps:

the method comprises the steps of collecting sound in a period of time by using a microphone array, converting the sound of a speaker into feature vectors by using a deep neural network formed by continuous application of a plurality of nonlinear functions for the extracted sound segments, wherein one speaker corresponds to a plurality of feature vectors.

3. The matching method based on voiceprints and human faces applicable to multiple targets as claimed in claim 1, wherein in the step (1), the specific method for speaker separation of the mixed sound source after feature extraction is as follows:

and adopting a supervised cyclic neural network sharing parameters to obtain the probability that different voiceprint fragment characteristics belong to the same speaker, and when the probability is greater than a threshold value, determining that the voiceprint fragment characteristics belong to the speaker, and splicing the voiceprint fragment characteristics belonging to the same speaker together according to the time sequence and the result of speaker separation to form the complete voiceprint characteristics of each speaker in the time.

4. The method for matching based on the voiceprint and the human face, which is suitable for multiple targets, according to claim 1, wherein in the step (1), the specific method for obtaining the voiceprint cepstrum of the multiple targets in the period of time is as follows:

the voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained.

5. The method for matching based on voiceprint and human face for multiple targets as claimed in claim 1, wherein in step (2), the specific method for extracting the moving image in the same time and performing multiple face recognition on the image is as follows:

6. The matching method based on the voiceprints and the human faces applicable to multiple targets as claimed in claim 1, wherein in the step (2), for each human face image, the width change between the mouth corners in the human face is mainly concerned, and the specific method for extracting the image features is as follows:

and (3) extracting the characteristics of the parts of the key points of the mouth corners by adopting a convolutional neural network to obtain a characteristic face sequence of the mouth corner change corresponding to each face.

7. The voiceprint and face matching method applicable to multiple targets according to claim 1, wherein in the step (2), the specific method for obtaining the moving image cepstrum of different faces in the period of time is as follows:

and performing motion analysis on the obtained characteristic face sequence by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing frequency spectrum analysis based on a Mel cepstrum coefficient on the characteristic vectors.

8. The method for matching based on voiceprint and human face as claimed in claim 1, wherein in step (3), similarity matching is performed on the voiceprint cepstrum and the human face cepstrum, the target successfully matched is regarded as a service object, and the specific method for judging as abnormal that matching is unsuccessful is as follows: and setting a threshold, and when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, considering the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum as matching, and considering the successfully matched target as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the voiceprint cepstrum and the face cepstrum is measured by cosine similarity.