CN112906544A - Voiceprint and face-based matching method suitable for multiple targets - Google Patents

Voiceprint and face-based matching method suitable for multiple targets Download PDF

Info

Publication number
CN112906544A
CN112906544A CN202110174056.9A CN202110174056A CN112906544A CN 112906544 A CN112906544 A CN 112906544A CN 202110174056 A CN202110174056 A CN 202110174056A CN 112906544 A CN112906544 A CN 112906544A
Authority
CN
China
Prior art keywords
face
voiceprint
cepstrum
matching
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110174056.9A
Other languages
Chinese (zh)
Inventor
梁哲辉
梁东贵
曾宪毅
李紫楠
李韫莛
陈敏
熊伟
陈光辉
李莹
李永恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202110174056.9A priority Critical patent/CN112906544A/en
Publication of CN112906544A publication Critical patent/CN112906544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Collating Specific Patterns (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a matching method based on voiceprint and human face and suitable for multiple targets, which relates to the field of pattern matching and multiple target matching of service robots and the like in a business hall environment, and comprises the following steps: (1) extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the features extracted to obtain voiceprint cepstrums of a plurality of targets within the period of time; (2) extracting moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting image characteristics to obtain the cepstrum of the moving images of different faces in the period of time; (3) similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, and a target which is not successfully matched is regarded as abnormal. The invention overcomes the challenges of single utilization of biological information and the need of inputting the biological information into a database in advance in biological identification and authentication, can identify a new target and simultaneously ensures multi-target matching.

Description

Voiceprint and face-based matching method suitable for multiple targets
Technical Field
The invention belongs to the field of pattern recognition and multi-target matching, and particularly relates to a matching method based on voiceprint and human face, which is suitable for multiple targets, for a service robot and the like in a business hall environment.
Background
With the development of artificial intelligence technology, biological identification is paid more and more attention, but the current biological identification method on the market is single, or only can authenticate biological information existing in a database, and cannot be applied to complex scenes matched with multi-target authentication.
Most of the existing authentication methods based on voiceprints and human faces on the market need to input voiceprint information and human face information of a user into a database in advance, then new voiceprint information and human face information of the user are collected in the authentication process, and then the new voiceprint information and the new human face information are compared with the information input into the database, and if the new voiceprint information and the new human face information are consistent with the information input into the database, the authentication is considered to be successful. The authentication method needs to input the biological information of the user in advance, so that a new user cannot be identified and matched with multiple users, multi-target matching under a complex scene (such as a service business hall) is difficult to solve, and the user to be served can be accurately found.
When the service robot is in the presence of multiple users who are speaking, conventional techniques cannot distinguish which of the users are the users to be serviced. The invention introduces the voiceprint recognition technology and the face recognition technology, determines the target service object by comparing the voiceprint change of each voice user with the face change of the user, avoids the service from being interfered by irrelevant voice, accurately recognizes the target user in a complex multi-user environment and completes continuous interaction.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and mainly solves the problems by the following technical methods:
a voiceprint and face based matching method suitable for multiple targets is characterized by comprising the following steps:
(1) and extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the extracted features to obtain voiceprint cepstrums of a plurality of targets within the period of time.
(2) And extracting the moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting the image characteristics to obtain the moving image cepstrum of different faces in the period of time.
(3) Similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, and a target which is not successfully matched is regarded as abnormal.
In the step (1), the sound segments within a period of time are extracted, and the specific method for extracting the features comprises the following steps:
the method comprises the steps of collecting sound in a period of time by using a microphone array, converting the sound of a speaker into feature vectors by using a deep neural network formed by continuous application of a plurality of nonlinear functions for the extracted sound segments, wherein one speaker corresponds to a plurality of feature vectors. The voiceprint characteristics can be obtained by:
Figure BDA0002939948710000021
wherein y represents the activation vector of the last hidden layer of the deep neural network, spk represents the current speaker,
Figure BDA0002939948710000022
representing all possible speakers contained in the presented sound clip.
In the step (1), the specific method for separating speakers from the mixed sound source after the characteristics are extracted is as follows:
the voiceprint features obtained by the steps are mixed and possibly comprise different voiceprint fragment features of the same speaker, in order to separate speakers from the mixed voiceprint features, a supervised cyclic neural network sharing parameters is used for obtaining the probability that the different voiceprint fragment features belong to the same speaker, and when the probability is greater than a threshold value, the voiceprint fragment features can be considered to belong to the speaker. The probability can be obtained by:
Figure BDA0002939948710000023
wherein X is (X)1,x2,…,xT) Representing a voiceprint fragment feature obtained from claim 2, y ═ (y)1,y2,…,yT) Representing the speaker tag corresponding to the voiceprint fragment feature. According to the result of the separation of the speakers, the voiceprint fragment characteristics belonging to the same speaker are spliced together according to the time sequence to form the complete voiceprint characteristics of each speaker in the time.
In the step (1), the specific method for obtaining the voiceprint cepstrum of the multiple targets in the period of time is as follows:
and carrying out spectrum analysis on the complete voiceprint characteristics of each speaker in a period of time obtained by the steps based on the Mel cepstrum coefficient to form a result which is more beneficial to comparison and visualization. The voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained. The derivation of the coefficients of the mel-frequency cepstrum is obtained by:
X[k]=FTx[n]
Figure BDA0002939948710000024
Figure BDA0002939948710000031
Figure BDA0002939948710000032
in the step (2), a specific method for extracting a moving image in the same time and performing multi-face recognition on the image comprises the following steps:
the method comprises the steps of collecting images within a period of time by using a camera array to form a video, carrying out face recognition on an image sequence, marking key points on the corners of mouths in the faces, then segmenting multiple targets in the same image, and storing an independent image sequence for each face.
In the step (2), regarding each face image focusing on the width change between the mouth corners in the face, the specific method for extracting the image features is as follows:
and performing feature extraction on the image sequence of each face obtained in the steps, and performing feature extraction on the part of the key point of the mouth angle by adopting a convolutional neural network to obtain a feature face sequence of the mouth angle change corresponding to each face. The operation of the convolutional neural network is represented by:
Figure BDA0002939948710000033
where m is a parameter related to the previous layer, s is the size of the convolution kernel, and u and v are weight sharing parameters.
In the step (2), the specific method for obtaining the moving image cepstrum of different faces in the period of time is as follows:
and performing motion analysis on the characteristic face sequence obtained in the step by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing spectrum analysis based on a Mel-cepstrum coefficient on the characteristic vectors. The interframe difference method is represented by the following equation:
Figure BDA0002939948710000034
d (x, y) is a difference image between two consecutive frames of images, I (T) and I (T-1) are images at time T and time T-1, respectively, T is a threshold value selected during binarization of the difference image, D (x, y) 1 represents a foreground, and D (x, y) 0 represents a background.
In the step (3), similarity matching is performed on the voiceprint cepstrum and the face cepstrum, a target with successful matching is regarded as a service object, and a specific method for considering that matching is unsuccessful is as follows:
and setting a threshold, and when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, considering the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum as matching, and considering the successfully matched target as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the voiceprint cepstrum and the face cepstrum is measured by the cosine similarity of the following formula:
Figure BDA0002939948710000041
wherein A is a voiceprint cepstrum and B is a face cepstrum.
Drawings
FIG. 1 is a general flowchart of a voiceprint and face based matching method for multiple targets according to an embodiment;
fig. 2 is an application scene diagram of a voiceprint and face-based matching method suitable for multiple targets according to an embodiment.
Detailed Description
The following describes in detail a specific embodiment of a voiceprint and face based matching method for multiple targets according to the present invention with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
Examples
As shown in fig. 1, an overall flow diagram of a voiceprint and face based matching method for multiple targets is depicted; and fig. 2 depicts an application scenario diagram of a voiceprint and face based matching method suitable for multiple targets according to an embodiment.
A voiceprint and face based matching method suitable for multiple targets comprises the following steps:
1. and the service robot extracts the sound segments within a period of time, performs feature extraction, and performs speaker separation on the mixed sound source with the extracted features to obtain voiceprint cepstrums of a plurality of targets within the period of time.
1.1 the built-in microphone array of the service robot is used for collecting the sound in a period of time, for the extracted sound segment, the deep neural network formed by continuous application of a plurality of nonlinear functions is used for converting the sound of a speaker into feature vectors, one speaker corresponds to a plurality of feature vectors, the process can realize the elimination of the environmental noise, and only the voiceprint feature of the speaker is extracted. The voiceprint characteristics can be obtained by:
Figure BDA0002939948710000051
wherein y represents the activation vector of the last hidden layer of the deep neural network, spk represents the current speaker,
Figure BDA0002939948710000052
representing all possible speakers contained in the presented sound clip.
1.2 the voiceprint features obtained from 1.1 are mixed, possibly containing different voiceprint fragment features of the same speaker, in order to separate speakers from the mixed voiceprint features, a supervised cyclic neural network sharing parameters is used to obtain the probability that different voiceprint fragment features belong to the same speaker, and when the probability is greater than a threshold value, the voiceprint fragment features can be considered to belong to the speaker. The probability can be obtained by:
Figure BDA0002939948710000053
wherein X is (X)1,x2,…,xT) Representing a voiceprint fragment feature obtained from claim 2, y ═ (y)1,y2,…,yT) Representing the speaker tag corresponding to the voiceprint fragment feature. According to the result of the separation of the speakers, the voiceprint fragment characteristics belonging to the same speaker are spliced together according to the time sequence to form the complete voiceprint characteristics of each speaker in the time.
1.3 carrying out spectrum analysis on the complete voiceprint characteristics of each speaker obtained in the step 1.2 in a period of time based on the Mel cepstrum coefficient to form a result which is more beneficial to comparison and visualization. The voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained. The derivation of the coefficients of the mel-frequency cepstrum is obtained by:
X[k]=FTx[n]
Figure BDA0002939948710000054
Figure BDA0002939948710000055
Figure BDA0002939948710000056
2. the service robot extracts the moving images in the same period of time, performs multi-face recognition on the images, focuses on the width change between the mouth corners in the face for each face image, and extracts the image characteristics to obtain the moving image cepstrum of different faces in the period of time.
2.1 collecting images within a period of time by using a camera array built in the service robot to form a video, carrying out face recognition on an image sequence, marking key points on mouth corners in the faces, then segmenting multiple targets in the same image, and storing an independent image sequence for each face.
And 2.2, performing feature extraction on the image sequence of each face obtained in the step 2.1, and performing feature extraction on the part of the key point of the mouth angle by adopting a convolutional neural network to obtain a feature face sequence of the mouth angle change corresponding to each face. The operation of the convolutional neural network is represented by:
Figure BDA0002939948710000061
where m is a parameter related to the previous layer, s is the size of the convolution kernel, and u and v are weight sharing parameters.
And 2.3, performing motion analysis on the characteristic face sequence obtained in the step 2.2 by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing spectrum analysis based on a Mel cepstrum coefficient on the characteristic vectors. The interframe difference method is represented by the following equation:
Figure BDA0002939948710000062
d (x, y) is a difference image between two consecutive frames of images, I (T) and I (T-1) are images at time T and time T-1, respectively, T is a threshold value selected during binarization of the difference image, D (x, y) 1 represents a foreground, and D (x, y) 0 represents a background.
3. Similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, the service robot actively provides service before starting, the target which is unsuccessfully matched is regarded as abnormal, and the service robot continuously searches for the target which needs service. Specifically, a threshold is set, when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum are regarded as matching, and the target successfully matched is regarded as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the audio cepstrum and the face cepstrum is measured by the cosine similarity of the following formula:
Figure BDA0002939948710000063
wherein A is a voiceprint cepstrum and B is a face cepstrum.
5. The invention provides a matching method based on voiceprint and human face, which is suitable for multiple targets, overcomes the challenges that the utilization of biological information is single and the biological information needs to be input into a database in advance in biological identification and authentication, can identify new targets and simultaneously ensure multi-target matching, and has wide application prospects in the fields of biological identification and authentication, business hall service robots and the like.

Claims (8)

1. A voiceprint and face based matching method suitable for multiple targets is characterized by comprising the following steps:
(1) extracting sound segments within a period of time, performing feature extraction, and performing speaker separation on the mixed sound source with the features extracted to obtain voiceprint cepstrums of a plurality of targets within the period of time;
(2) extracting moving images in the same period of time, performing multi-face recognition on the images, focusing on the width change between the corners of the mouth in the face for each face image, and extracting image characteristics to obtain the cepstrum of the moving images of different faces in the period of time;
(3) similarity matching is carried out on the voiceprint cepstrum and the face cepstrum, a target which is successfully matched is regarded as a service object, and a target which is not successfully matched is regarded as abnormal.
2. The matching method based on the voiceprint and the human face suitable for multiple targets as claimed in claim 1, wherein in the step (1), the voice segments within a period of time are extracted, and the specific method for extracting the features comprises the following steps:
the method comprises the steps of collecting sound in a period of time by using a microphone array, converting the sound of a speaker into feature vectors by using a deep neural network formed by continuous application of a plurality of nonlinear functions for the extracted sound segments, wherein one speaker corresponds to a plurality of feature vectors.
3. The matching method based on voiceprints and human faces applicable to multiple targets as claimed in claim 1, wherein in the step (1), the specific method for speaker separation of the mixed sound source after feature extraction is as follows:
and adopting a supervised cyclic neural network sharing parameters to obtain the probability that different voiceprint fragment characteristics belong to the same speaker, and when the probability is greater than a threshold value, determining that the voiceprint fragment characteristics belong to the speaker, and splicing the voiceprint fragment characteristics belonging to the same speaker together according to the time sequence and the result of speaker separation to form the complete voiceprint characteristics of each speaker in the time.
4. The method for matching based on the voiceprint and the human face, which is suitable for multiple targets, according to claim 1, wherein in the step (1), the specific method for obtaining the voiceprint cepstrum of the multiple targets in the period of time is as follows:
the voiceprint characteristics are subjected to Fourier transform, then a triangular window function is utilized to map the frequency spectrum to the Mel scale, logarithm is taken, then discrete cosine conversion is carried out, and the voiceprint cepstrums of a plurality of targets in the period of time are obtained.
5. The method for matching based on voiceprint and human face for multiple targets as claimed in claim 1, wherein in step (2), the specific method for extracting the moving image in the same time and performing multiple face recognition on the image is as follows:
the method comprises the steps of collecting images within a period of time by using a camera array to form a video, carrying out face recognition on an image sequence, marking key points on the corners of mouths in the faces, then segmenting multiple targets in the same image, and storing an independent image sequence for each face.
6. The matching method based on the voiceprints and the human faces applicable to multiple targets as claimed in claim 1, wherein in the step (2), for each human face image, the width change between the mouth corners in the human face is mainly concerned, and the specific method for extracting the image features is as follows:
and (3) extracting the characteristics of the parts of the key points of the mouth corners by adopting a convolutional neural network to obtain a characteristic face sequence of the mouth corner change corresponding to each face.
7. The voiceprint and face matching method applicable to multiple targets according to claim 1, wherein in the step (2), the specific method for obtaining the moving image cepstrum of different faces in the period of time is as follows:
and performing motion analysis on the obtained characteristic face sequence by using an interframe difference method, wherein each face corresponds to a series of characteristic vectors, and performing frequency spectrum analysis based on a Mel cepstrum coefficient on the characteristic vectors.
8. The method for matching based on voiceprint and human face as claimed in claim 1, wherein in step (3), similarity matching is performed on the voiceprint cepstrum and the human face cepstrum, the target successfully matched is regarded as a service object, and the specific method for judging as abnormal that matching is unsuccessful is as follows: and setting a threshold, and when the similarity error between the voiceprint cepstrum and the face cepstrum is smaller than the threshold, considering the corresponding voiceprint and face with the maximum similarity between the voiceprint cepstrum and the face cepstrum as matching, and considering the successfully matched target as a service object. When the similarity error of a certain voiceprint cepstrum and any face cepstrum is greater than the threshold, or when the similarity error of a certain face cepstrum and any face cepstrum is greater than the threshold, the matching is unsuccessful, and the matching is regarded as abnormal, because the reason may be that the sound of the target is collected by the microphone array but is not in the collection range of the camera array, or the face of the target is collected by the camera array but the target does not speak. The similarity of the voiceprint cepstrum and the face cepstrum is measured by cosine similarity.
CN202110174056.9A 2021-02-07 2021-02-07 Voiceprint and face-based matching method suitable for multiple targets Pending CN112906544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110174056.9A CN112906544A (en) 2021-02-07 2021-02-07 Voiceprint and face-based matching method suitable for multiple targets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110174056.9A CN112906544A (en) 2021-02-07 2021-02-07 Voiceprint and face-based matching method suitable for multiple targets

Publications (1)

Publication Number Publication Date
CN112906544A true CN112906544A (en) 2021-06-04

Family

ID=76124160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110174056.9A Pending CN112906544A (en) 2021-02-07 2021-02-07 Voiceprint and face-based matching method suitable for multiple targets

Country Status (1)

Country Link
CN (1) CN112906544A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571060A (en) * 2021-06-10 2021-10-29 西南科技大学 Multi-person conversation ordering method and system based on visual-auditory fusion
CN116312552A (en) * 2023-05-19 2023-06-23 湖北微模式科技发展有限公司 Video speaker journaling method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571060A (en) * 2021-06-10 2021-10-29 西南科技大学 Multi-person conversation ordering method and system based on visual-auditory fusion
CN113571060B (en) * 2021-06-10 2023-07-11 西南科技大学 Multi-person dialogue ordering method and system based on audio-visual sense fusion
CN116312552A (en) * 2023-05-19 2023-06-23 湖北微模式科技发展有限公司 Video speaker journaling method and system
CN116312552B (en) * 2023-05-19 2023-08-15 湖北微模式科技发展有限公司 Video speaker journaling method and system

Similar Documents

Publication Publication Date Title
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN106599866B (en) Multi-dimensional user identity identification method
Plinge et al. A bag-of-features approach to acoustic event detection
US20040260550A1 (en) Audio processing system and method for classifying speakers in audio data
JP2001092974A (en) Speaker recognizing method, device for executing the same, method and device for confirming audio generation
CN112906544A (en) Voiceprint and face-based matching method suitable for multiple targets
Wang et al. Audio event detection and classification using extended R-FCN approach
Chowdhury et al. Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos
CN111341350A (en) Man-machine interaction control method and system, intelligent robot and storage medium
Brutti et al. Online cross-modal adaptation for audio–visual person identification with wearable cameras
Besson et al. Extraction of audio features specific to speech production for multimodal speaker detection
Cheng et al. The dku audio-visual wake word spotting system for the 2021 misp challenge
Ahmad et al. Speech enhancement for multimodal speaker diarization system
CN113707175A (en) Acoustic event detection system based on feature decomposition classifier and self-adaptive post-processing
CN117672202A (en) Environmental sound classification method for generating countermeasure network based on depth convolution
CN116612542A (en) Multi-mode biological feature consistency-based audio and video character recognition method and system
Luque et al. Audio, video and multimodal person identification in a smart room
Bredin et al. The biosecure talking-face reference system
Churaev et al. Multi-user facial emotion recognition in video based on user-dependent neural network adaptation
Luna-Jiménez et al. GTH-UPM System for Albayzin Multimodal Diarization Challenge 2020
Chennoor et al. Human emotion detection from audio and video signals
Nakamura et al. Speech-Section Extraction Using Lip Movement and Voice Information in Japanese
CN114049900B (en) Model training method, identity recognition device and electronic equipment
CN115472152B (en) Voice endpoint detection method and device, computer equipment and readable storage medium
CN113571060B (en) Multi-person dialogue ordering method and system based on audio-visual sense fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination