CN111862990B - Speaker identity verification method and system - Google Patents

Speaker identity verification method and system Download PDF

Info

Publication number
CN111862990B
CN111862990B CN202010705582.9A CN202010705582A CN111862990B CN 111862990 B CN111862990 B CN 111862990B CN 202010705582 A CN202010705582 A CN 202010705582A CN 111862990 B CN111862990 B CN 111862990B
Authority
CN
China
Prior art keywords
embedding
feature embedding
feature
facial
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010705582.9A
Other languages
Chinese (zh)
Other versions
CN111862990A (en
Inventor
钱彦旻
陈正阳
王帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202010705582.9A priority Critical patent/CN111862990B/en
Publication of CN111862990A publication Critical patent/CN111862990A/en
Application granted granted Critical
Publication of CN111862990B publication Critical patent/CN111862990B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Processing (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a speaker identity verification method, which comprises the following steps: acquiring audio data and facial image data of the speaker; extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data; and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification. The invention provides a scheme for carrying out personal identity verification by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be carried out due to the fact that the identity verification cannot be carried out under a single mode because the identity verification cannot be easily influenced by external factors is solved, and the success rate of identity verification is improved.

Description

Speaker identity verification method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a speaker identity verification method and a speaker identity verification system.
Background
The speaker identity verification methods in the prior art include a voiceprint-based verification method and a face recognition-based verification method. These techniques use certain physiological characteristics of a person to achieve the goal of verifying the identity of a person. A certain physiological characteristic of a person may in certain circumstances not have the condition to distinguish a certain person. For example, in a noisy environment, we may not hear the sound of a particular person; a human face feature may not have a condition for distinguishing one person when one person twists or while he/she is moving.
Disclosure of Invention
An embodiment of the present invention provides a method and a system for speaker identity verification, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a speaker identity verification method, including:
acquiring audio data and facial image data of the speaker;
extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data;
and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification.
In a second aspect, an embodiment of the present invention provides a speaker authentication system, including:
the audio-visual data acquisition module is used for acquiring audio data and facial image data of the speaker;
the feature extraction module is used for extracting voice feature embedding from the audio data and extracting facial feature embedding from the facial image data;
and the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification.
In a third aspect, an embodiment of the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any speaker authentication method of the present invention.
In a fourth aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speaker verification methods of the present invention described above.
In a fifth aspect, embodiments of the present invention further provide a computer program product, the computer program product including a computer program stored on a storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform any one of the above speaker authentication methods.
The embodiment of the invention has the beneficial effects that: the scheme for carrying out the identity verification of the person by using multi-modal information (from the face and the voice) is provided, the problem that the identity verification cannot be carried out due to the fact that the identity verification cannot be carried out under the single mode is easily affected by external factors is solved, and the success rate of the identity verification is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a speaker verification method of the present invention;
FIG. 2 is a flow chart of another embodiment of the speaker verification method of the present invention;
FIG. 3 is a functional block diagram of an embodiment of a speaker verification system of the present invention;
FIG. 4 is a functional block diagram of another embodiment of the speaker verification system of the present invention;
FIG. 5a is a schematic illustration of simple soft attention fusion as used in the present invention;
FIG. 5b is a schematic diagram of a compact bilinear pooling fusion used in the present invention;
FIG. 5c is a schematic representation of portal multimodal fusion used in the present invention;
FIG. 6a is a graph of the distance between positive and negative pairs at the loss of original contrast in the present invention;
FIG. 6b is a graph of the distance between positive and negative pairs at the new loss of contrast in the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this application, the terms "module," "apparatus," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprises 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The invention provides a speaker identity verification method, which can be used for terminal equipment, wherein the terminal equipment has the functions of face recognition and voiceprint recognition at the same time, for example, the terminal equipment can be a smart phone, a tablet personal computer, a smart sound box, a vehicle terminal, a smart robot and the like, and the invention is not limited to the above.
As shown in fig. 1, an embodiment of the present invention provides a speaker authentication method, including:
and S10, acquiring the audio data and the facial image data of the speaker.
Illustratively, the terminal device executing the method is a smartphone, and the audio data of the speaker can be acquired through a microphone of the smartphone, and meanwhile, the face image of the speaker can be acquired through a camera of the smartphone.
And S20, extracting voice feature embedding from the audio data, and extracting facial feature embedding from the facial image data.
Illustratively, the speech feature embedding may be a voiceprint feature. Resnet34 and SeResnet50 may be used to extract voiceprint and facial feature embedding, respectively.
And S30, determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to be used for speaker identity verification.
The embodiment of the invention provides a scheme for verifying the identity of a person by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be performed due to the fact that the identity verification cannot be performed under a single mode because the identity verification cannot be easily influenced by external factors is solved, and the success rate of identity verification is improved.
Fig. 2 is a flow chart of another embodiment of the speaker verification method of the present invention, in which the determining identity feature embedding according to the voice feature embedding and the facial feature embedding includes:
and S31, inputting the voice feature embedding into a first embedding feature conversion layer to obtain the preprocessing voice feature embedding.
And S32, inputting the facial feature embedding into a second embedding feature conversion layer to obtain the preprocessed facial feature embedding.
Illustratively, by translation layer f trans_f And f trans_v E is to be f And e v Are respectively converted into
Figure BDA0002594571350000051
And
Figure BDA0002594571350000052
Figure BDA0002594571350000053
Figure BDA0002594571350000054
Figure BDA0002594571350000055
after conversion
Figure BDA0002594571350000056
And
Figure BDA0002594571350000057
in the co-embedding space, which is more suitable for later fusion.
And S33, carrying out fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure BDA0002594571350000058
determining a weighting factor from the attention score:
Figure BDA0002594571350000059
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure BDA00025945713500000510
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure BDA00025945713500000511
in order to pre-process the speech feature embedding,
Figure BDA00025945713500000512
is embedded for pre-processing facial features.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(f att ([e f ,e v ]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure BDA0002594571350000061
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure BDA0002594571350000062
in order to pre-process the speech feature embedding,
Figure BDA0002594571350000063
for preprocessed facial feature embedding, an indicates a element-by-element product.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
As shown in fig. 3, an embodiment of the present invention further provides a speaker identity verification system 300, which may be used in a terminal device, where the terminal device has both functions of face recognition and voiceprint recognition, for example, the terminal device may be a smart phone, a tablet computer, a smart speaker, a car terminal, a smart robot, and the like, which is not limited in this respect.
As shown in fig. 3, the speaker authentication system 300 includes:
an audio-visual data acquisition module 310 for acquiring audio data and facial image data of the speaker;
a feature extraction module 320 for extracting speech feature embeddings from the audio data and facial feature embeddings from the facial image data;
an identity feature embedding determination module 330, configured to determine identity feature embedding according to the voice feature embedding and the facial feature embedding, for speaker authentication.
Fig. 4 is a schematic block diagram of another embodiment of the speaker verification system of the present invention, in which the identity embedding determination module includes:
a first embedded feature conversion layer 331, configured to perform preprocessing on the voice feature embedding to obtain a preprocessed voice feature embedding;
a second embedded feature conversion layer 332, configured to perform preprocessing on the facial feature embedding to obtain a preprocessed facial feature embedding;
and a fusion module 333, configured to perform fusion processing on the preprocessed voice feature embedding and the preprocessed facial feature embedding to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining, by an attention layer, an attention score based on the speech feature embedding and the facial feature embedding:
Figure BDA0002594571350000071
determining a weighting factor from the attention score:
Figure BDA0002594571350000072
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure BDA0002594571350000073
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure BDA0002594571350000074
in order to pre-process the speech feature embedding,
Figure BDA0002594571350000075
and embedding for preprocessing facial features.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(f att ([e f ,e v ]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure BDA0002594571350000081
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure BDA0002594571350000082
in order to pre-process the speech feature embedding,
Figure BDA0002594571350000083
for preprocessed facial feature embedding, an indication of element-by-element product.
In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above speaker identity verification methods of the present invention.
In some embodiments, the present invention also provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speaker authentication methods described above.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speaker verification method.
In some embodiments, the present invention further provides a storage medium, on which a computer program is stored, where the program is executed by a processor to implement a speaker identification verification method.
The speaker authentication system according to the embodiment of the present invention may be used to execute the speaker authentication method according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the speaker authentication method according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
In order to more clearly describe the technical solutions of the present invention and to more directly prove the real-time performance and the benefits of the present invention over the prior art, the technical background, the technical solutions, the experiments performed, etc. of the present invention will be described in more detail below.
Abstract
Information from different forms will usually compensate for each other. In the present invention, we use the audiovisual data in the VoxCeleb dataset for people verification. We explored different information fusion strategies and loss functions for audiovisual personnel verification systems at the level of embedded features. System performance was evaluated using a common test list on the VoxCeleb1 dataset. We achieved 0.585%,0.427%, and 0.735% EER on the three public test lists of VoxCeleb1 using the best system of audiovisual knowledge at the embedded feature level, which is the best reported result on this data set. Furthermore, to mimic a more complex test environment of modal impairments or deletions, we constructed a noisy evaluation set based on the VoxCeleb1 data set. We use data enhancement strategies at the embedding feature level to help our audiovisual system distinguish between noisy and clean embedding. With this data enhancement strategy, the proposed audiovisual personnel verification system is more powerful on noisy evaluation sets.
1. Introduction to
A variety of biometrics may be used to verify a person's identity, with voice and facial expressions being two typical features. Face verification and speaker verification are subjects of intense research in the field of biometric identification. Deep learning techniques have recently greatly improved the performance of both tasks. Over the past few years researchers have studied different architectures and different loss functions, resulting in well-performing systems that can even be commercialized for real world applications.
Despite the success in single-modality applications, multi-modality learning has attracted increasing attention in both academic and industrial areas. The motivation has two aspects:
1. complementary information from different modalities may improve system performance.
2. Models built from multiple modalities tend to be more robust and fault tolerant and can repair or suppress faults in a single mode.
In the present invention, cross-modal integration is performed at the embedding level, while speaker embedding using more powerful segment-level training. Different fusion strategies and loss functions were studied and compared in a multimodal learning framework.
Furthermore, to mimic a real-world scenario, we construct a noisy set of estimates, where one modality is corrupted or missing. In order to compensate performance reduction, a new Noise Distribution Matching (NDM) data enhancement method embedded in a feature level is proposed, and the method greatly improves performance under a noisy condition.
All systems were evaluated on a standard Vox-Celeb1 dataset, while our best multimodal system achieved EERs of 0.585%,0.427%, and 0.735% on the three test lists (Vox-Celeb 1_ O, voxCeleb1_ E, and VoxCeleb1_ H), respectively. To our knowledge, this is the best reported result on this data set. Furthermore, NDM-based multimodal systems show the ability to select more prominent modality information when evaluating on a noisy evaluation set.
2. Method for producing a composite material
2.1 feature level-Embedded multimodal fusion
In this section, we will describe three ways of embedding e facial features f And speech feature embedding e v Fusion into identity feature embedding e p The method of (1). As shown in fig. 5a to 5c, first by the conversion layer f trans_f And f trans_v E is to be f And e v Respectively convert into
Figure BDA0002594571350000101
And
Figure BDA0002594571350000102
Figure BDA0002594571350000103
after conversion
Figure BDA0002594571350000104
And
Figure BDA0002594571350000105
in the co-embedding space, which is more suitable for later fusion.
2.1.1 simple Soft attention fusion
In this section, we first introduce a Simple Soft Attention (SSA). As shown in FIG. 5a, given face and speech features are embedded e f And e v Through the attention layer f att (. The score attention)
Figure BDA0002594571350000106
Is defined as:
Figure BDA0002594571350000107
then through weighted sum computation fusion embedding:
Figure BDA0002594571350000108
2.1.2 compact bilinear pooling fusion
Bilinear pooling exploits the outer product operation to fully explore the relationship between two vectors and does not involve training parameters. But do notIs, it is generally not feasible in practice due to the high dimension of the outer product. A method called multi-modal compact bilinear pooling (MCB) has been proposed in the prior art to approximate the outer product result and at the same time reduce the dimension of the result. Notably, there are no training parameters in the MCB either. As shown in FIG. 5b, we use compact bilinear pooling directly
Figure BDA0002594571350000109
And
Figure BDA00025945713500001010
fusion to e p . Details of the implementation of compact bilinear pooling can be found in the prior art, and the present invention is not limited in this regard.
2.1.3 Portal multimodal fusion
In this section, we use GATEs to control the flow of information in face and speech modalities, which we call gated multi-modal fusion (GATE). As shown in fig. 5c, given face and speech features are embedded e f And e v The exit vector z ∈ R can be calculated D
z=σ(f att ([e f ,e v ]))
Then, we use the gate vector z to
Figure BDA0002594571350000111
To know
Figure BDA0002594571350000112
Fusion to e p And, indicates a product element by element:
Figure BDA0002594571350000113
2.2 loss function
In this section, we will introduce a loss function for optimizing the proposed multimodal fusion system.
2.2.1 contrast loss for aggressive sampling strategy
The original contrast loss is defined as:
Figure BDA0002594571350000114
where D is the distance between a pair and N and M are the number of positive and negative pairs in a batch. y =1 and y =0 represent the positive and negative pairs, respectively, and m is the margin. In our experiment, we use cosine similarity to measure the distance of the embedded pair.
The adjusted margin m in the original contrast penalty makes the penalty more focused on the difficult negative pair. However, the difficult alignment is not considered. Here we introduce a more aggressive sampling strategy. During the training process, after the neural network is propagated forward, we calculate the loss using only a subset of γ M most difficult negative pairs and γ N most difficult positive pairs (γ e (0, 1)), and the contrast loss using the new sampling strategy can be defined as:
Figure BDA0002594571350000115
wherein D is p_low Denotes the minimum distance, D, of all "hardest" positive numbers n_high Representing the maximum distance among all "hardest" negatives.
2.2.2 addition Angle margin loss
In addition, we also tried the prevailing loss of angular margin in the experiment. For tags with personal identities y s The loss is defined as:
Figure BDA0002594571350000116
where m is the addition margin and s is the scale parameter, which may help the model to converge faster. In our experiment, s was set to 32 and m to 0.6 in the fusion system.
2.3 Embedded layer enhancement for noisy evaluation
2.3.1 noisy evaluation set construction
Information from different modalities is not always available or significant enough to perform the verification task. In practical applications, a modality is often damaged or missing due to certain unavoidable external factors (e.g., ambient light, human movement, or background noise). To address this situation, we constructed a noisy evaluation set based on the VoxCeleb1 evaluation set.
For image data, we use vertical and horizontal motion blur to mimic human motion in front of the lens, and gaussian blur to mimic other noise. For audio data, the three types of noise in Musan are combined with the original data to generate corrupted audio samples. We also consider the case of a complete absence of one modality by directly setting the corresponding extracted embeddings to zero values. A detailed flow of constructing this data set is shown in algorithm 1.
Algorithm 1: noisy evaluation set structure
Figure BDA0002594571350000121
Figure BDA0002594571350000131
2.3.2 Embedded layer enhancement
In order to build a system that is more robust to corrupted audiovisual data, an additional embedded layer enhancement strategy is proposed in this work. In our previous work, we used deep generation models such as the generative countermeasure network (GAN) or the Variational Autocoder (VAE) to simulate the distribution of noisy speaker embeddings. Here, a simple statistical-based distribution matching algorithm is used.
We randomly selected 100,000 records (1,092,009 records) from the training set and generated different types of damage data. Then, for each noise type, we assume that the difference between the noise embedding and the original embedding can be described by a gaussian distribution. After estimating the parameters of the noise distribution, we sample the noise in the distribution and add it directly to the original embedding to generate a noisy embedding. We refer to this embedding layer enhancement method as Noise Distribution Matching (NDM). Compared to embedding where noise is added directly to the entire training set and enhanced extraction, NDM uses only a small portion of the training data and enhances the embedding directly, saving time and disk. Furthermore, we still use the zero vector to model the case of modal loss.
3. Experimental setup
3.1, data set
In our experiments we used both visual and audio data from both the VoxCeleb1 and 2 datasets. For training, we used the DEV portion of the VoxCeleb2 dataset, which includes 5,994 bits of speaker and 1,092,009 utterances. VoxCeeb 1 is used as the evaluation set, and all three formal test lists VoxCeeb 1-O, voxCeeb 1-E, and VoxCeeb 1-H are used as the evaluations. Note that the visual data from the official VoxCeleb1 dataset is incomplete and we downloaded the missing visual data from youtube and published it.
3.2 Experimental setup
3.2.1 Single Modal System
For audio data, 40-dimensional Fbank features were extracted using the Kaldi toolkit and silence frames were deleted using an energy-based voice activity detector. Then we perform CMN on the Fbank function using the sliding window size 300. For video data, we extract 1 frame per second. We then use MTCNN to detect face landmarks and use similarity transformation to map face regions to the same shape (3 x112x 96). Finally, we normalize the pixel values of each image to [0,1] and subtract 0.5 to map the range of values to [ -0.5,0.5].
During the training process, the Fbank feature from a sentence is divided into blocks, with block sizes from 200 to 400. During the test, we extract one speech embedding for each recording, and then average the face embeddings of a recording to obtain a face representation.
In our experiment, a 50-layer SE-ResNet was used for the face system and a 34-layer ResNet was used for the voice system. The embedding of both systems is set to 512 dimensions. The margin (margin) m =0.2 AAM loss is used to optimize both systems.
3.2.2 multimodal System
Facial and speech feature insertions are extracted from the single modality system for all recordings in the training set. Then, L2 normalization is performed on all embeddings to build a new training set for the audiovisual multimodal system.
For the SSA fusion system, the translation layers are two fully connected layers each having 512 units, and the attention layer is a fully connected layer having two cells. For compact bilinear fusion and gated multi-model fusion, the translation layer is a fully-connected layer with 512 cells. The attention layer in the gated multi-model fusion system is two fully connected layers with 32 and 512 cells respectively. For all the above adjacent fully connected layers we insert another batchnorm and relu layer in the middle.
4. Results and analysis
4.1 evaluation of embedding layer multimodal fusion
In order to fuse information in face and voice modalities, different fusion strategies and different loss functions were explored and compared in our embedded-level fusion system. The results and analysis are presented in this section.
The results for the single modality system are shown at the top of table 1. We find face and speech single modality systems to be substantially comparable. As shown in the third row of table 1, the simple average score results between the two single-modality systems greatly surpassed the two single-modality systems, indicating a strong complementary role between audio and visual modalities.
4.1.1 loss function comparison
First, the SSA fusion strategy under controlled loss supervision was investigated. However, as shown in the middle of table 1, in our experiments the original contrast loss based system did not converge to the optimum and the performance of the fused system was much worse than even the single modality system. To increase the comparable loss, a revision with the more aggressive sampling strategy introduced in section 2.2.1 was used, with much better results (SSA + Con-new). To more intuitively demonstrate the effectiveness of the new strategy, the distribution of the distance between the positive and negative pairs is shown in fig. 6a and 6b, where fig. 6a is a plot of the distance between the positive and negative pairs at the original contrast loss and fig. 6a is a plot of the distance between the positive and negative pairs at the new contrast loss. Indicating that a new loss of contrast may enlarge the difference between the positive and negative distances. Furthermore, in addition to the destructive losses, we also used AAM-Softmax losses based on classification for multi-modal system optimization, which outperforms the destructive losses far. AAM-softmax and new comparative losses will be used mainly for the following experiments.
4.1.2 fusion strategy comparison
In this section the different fusion strategies introduced in section 2.1 are compared, while the AAM-softmax loss or new contrast loss provides the monitoring signal. The results are shown in the middle part of the table. From the results, all three fusion strategies achieved significant improvements compared to the unimodal system, and the gated multi-modal fusion architecture performed best. However, simple score averaging still performed best, which is not consistent with the findings in [23 ]. The possible reason is that we have a more powerful single-modal system in this work: using the same VoxColeb 2 test list, our face and speech EERs are 4.08% and 3.43%, respectively, while the corresponding numbers in [23] are 14.5% and 8.03%2. The larger difference can also be attributed to different experimental settings, we have employed segment-level optimization in the system, while the authors in [23] use a frame-level embedding extractor for online validation.
Further, when we use both AAM loss and new contrast loss, further improvement can be obtained and the performance on the Voxceleb 1E and Voxceleb 1H paths exceeds the score average results. The results are shown in the penultimate row of the table. 1. Surprisingly, we found that a fusion system using the proposed model is a complement to the simple fractional averaging system. When we further averaged the score of the GATE + AAM + Con-new fusion system with the average score of the unimodal system, the best system performance was obtained. To our knowledge, this is also the best release result for human validation on the VoxCeleb1 evaluation dataset.
Table 1: results of different fusion strategies and penalties were used for comparison. The disadvantages are as follows: the original contrast is lost. The method comprises the following steps: the contrast loss suggested by using a more aggressive sampling strategy. M in Con-orig is set to 0.5 and m in Con-new is set to 0.05.
Figure BDA0002594571350000161
4.2 assessment of Damage and loss modes
To test fusion systems under more complex realistic conditions, where one mode is corrupted or lost, the results are evaluated using the noisy evaluation set shown in section 2.3.1 and displayed in a table. 2. From the results, we found that simple fractional averaging operations can still significantly improve performance, and the proposed multimodal fusion system trained with enhanced embedded data achieves the best results in this case. Furthermore, audiovisual fusion systems that train only clean embedding do not have the ability to distinguish noise embedding from clean embedding well and the results are somewhat worse. Note that the results in parentheses indicate that the proposed fusion system using enhanced embedded training still performs well on a clean evaluation set.
Table 2: results (EER%) comparison over a noisy evaluation set. We used here the GATE + AMM + Con-new fusion system. Train _ Clean: the fusion system is trained with clean embedding. Train _ Noise: the fusion system is trained with enhanced noise embedding. The results in brackets were tested in a clean evaluation set.
Figure BDA0002594571350000162
5. Conclusion
In this context, we explore different multimodal fusion strategies and loss functions for people verification systems, and can efficiently combine audiovisual information at the embedding level. Based on a powerful single-modality system, our best system achieved 0.585%,0.427%, and 0.735% EER on the three official test lists of VoxColeb 1, which, to our knowledge, is the best result published on this data set. We also introduce an embedded level data enhancement method that can help audiovisual multi-modal people verification systems perform well when certain modalities are damaged or lost.
Fig. 7 is a schematic hardware structure diagram of an electronic device for performing a speaker authentication method according to another embodiment of the present application, and as shown in fig. 7, the electronic device includes:
one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7.
The apparatus performing the speaker authentication method may further include: an input device 730 and an output device 740.
The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7.
The memory 720, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speaker identification verification method in the embodiments of the present application. The processor 710 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 720, so as to implement the speaker identity authentication method of the above-described method embodiments.
The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speaker authentication apparatus, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 720 optionally includes memory located remotely from processor 710, which may be connected to the speaker authentication device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may receive input numeric or character information and generate signals related to user settings and function control of the speaker authentication device. The output device 740 may include a display device such as a display screen.
The one or more modules are stored in the memory 720 and, when executed by the one or more processors 710, perform the speaker verification exercise method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functions and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions in essence or part contributing to the related art can be embodied in the form of a software product, which can be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to various embodiments or some parts of embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present application.

Claims (10)

1. A speaker identity verification method, comprising:
acquiring audio data and facial image data of the speaker;
extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data;
determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification;
wherein said determining identity feature embedding from said voice feature embedding and said facial feature embedding comprises:
embedding the speech features in e v Input to a first Embedded feature conversion layer f trans_v Deriving preprocessed speech feature embedding
Figure FDA0003750837080000016
Embedding the facial features into e f Input to a second Embedded feature conversion layer f trans_f Deriving pre-processed facial feature embedding
Figure FDA0003750837080000017
Thereby, the device
Figure FDA0003750837080000018
And
Figure FDA0003750837080000019
in a co-embedding space to be more suitable for later fusion;
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.
2. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining, by an attention layer, an attention score based on the speech feature embedding and the facial feature embedding:
Figure FDA0003750837080000011
determining a weighting factor from the attention score:
Figure FDA0003750837080000012
determining identity feature embedding according to the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure FDA0003750837080000013
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure FDA0003750837080000014
in order to pre-process the speech feature embedding,
Figure FDA0003750837080000015
is embedded for pre-processing facial features.
3. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
4. The method of claim 1, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(f att ([e f ,e υ ]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure FDA0003750837080000021
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure FDA0003750837080000022
in order to pre-process the speech feature embedding,
Figure FDA0003750837080000023
for preprocessed facial feature embedding, an indicates a element-by-element product.
5. A speaker authentication system, comprising:
the audio-visual data acquisition module is used for acquiring audio data and facial image data of the speaker;
the feature extraction module is used for extracting voice feature embedding from the audio data and extracting facial feature embedding from the facial image data;
the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification;
wherein the identity feature embedding determination module comprises:
a first embedded feature conversion layer for embedding e into the speech feature v Input to a first Embedded feature conversion layer f trans_v Deriving preprocessed speech feature embedding
Figure FDA0003750837080000024
A second embedded feature conversion layer for facing the surfacePartial feature embedding e f Input to a second Embedded feature conversion layer f trans_f Deriving preprocessed facial feature embedding
Figure FDA0003750837080000025
Thereby to obtain
Figure FDA0003750837080000026
And
Figure FDA0003750837080000027
in a co-embedding space to be more suitable for later fusion;
and the fusion module is used for carrying out fusion processing on the preprocessed voice characteristic embedding and the preprocessed face characteristic embedding so as to obtain identity characteristic embedding.
6. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure FDA0003750837080000031
determining a weighting factor from the attention score:
Figure FDA0003750837080000032
determining identity feature embedding according to the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure FDA0003750837080000033
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure FDA0003750837080000034
in order to pre-process the speech feature embedding,
Figure FDA0003750837080000035
is embedded for pre-processing facial features.
7. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
8. The system of claim 5, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(f att ([e f ,e v ]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure FDA0003750837080000036
wherein e is v For speech feature embedding, e f For the purpose of facial feature embedding,
Figure FDA0003750837080000037
in order to pre-process the speech feature embedding,
Figure FDA0003750837080000038
for preprocessed facial feature embedding, an indicates a element-by-element product.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN202010705582.9A 2020-07-21 2020-07-21 Speaker identity verification method and system Active CN111862990B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010705582.9A CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010705582.9A CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Publications (2)

Publication Number Publication Date
CN111862990A CN111862990A (en) 2020-10-30
CN111862990B true CN111862990B (en) 2022-11-11

Family

ID=73000790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010705582.9A Active CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Country Status (1)

Country Link
CN (1) CN111862990B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273859B (en) * 2021-04-30 2024-05-28 清华大学 Safety testing method and device for voice verification device
CN114995657B (en) * 2022-07-18 2022-10-21 湖南大学 Multimode fusion natural interaction method, system and medium for intelligent robot
CN116504226B (en) * 2023-02-27 2024-01-02 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117011924B (en) * 2023-10-07 2024-02-13 之江实验室 Method and system for estimating number of speakers based on voice and image
CN117155583B (en) * 2023-10-24 2024-01-23 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020046831A1 (en) * 2018-08-27 2020-03-05 TalkMeUp Interactive artificial intelligence analytical system
CN109910818B (en) * 2019-02-15 2021-10-08 东华大学 Vehicle anti-theft system based on human body multi-feature fusion identity recognition
CN110674483B (en) * 2019-08-14 2022-05-13 广东工业大学 Identity recognition method based on multi-mode information
CN111310648B (en) * 2020-02-13 2023-04-11 中国科学院西安光学精密机械研究所 Cross-modal biometric feature matching method and system based on disentanglement expression learning
CN111246256B (en) * 2020-02-21 2021-05-25 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning

Also Published As

Publication number Publication date
CN111862990A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111862990B (en) Speaker identity verification method and system
CN109637546B (en) Knowledge distillation method and apparatus
CN110956966B (en) Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment
WO2021203880A1 (en) Speech enhancement method, neural network training method, and related device
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN112149615B (en) Face living body detection method, device, medium and electronic equipment
CN113537005A (en) On-line examination student behavior analysis method based on attitude estimation
CN113343898B (en) Mask shielding face recognition method, device and equipment based on knowledge distillation network
CN113361396B (en) Multi-mode knowledge distillation method and system
WO2020124993A1 (en) Liveness detection method and apparatus, electronic device, and storage medium
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN109214616B (en) Information processing device, system and method
CN114616565A (en) Living body detection using audio-visual disparity
CN111401259A (en) Model training method, system, computer readable medium and electronic device
CN111259759B (en) Cross-database micro-expression recognition method and device based on domain selection migration regression
CN110232927B (en) Speaker verification anti-spoofing method and device
CN115565548A (en) Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
WO2024131291A1 (en) Face liveness detection method and apparatus, device, and storage medium
CN113851113A (en) Model training method and device and voice awakening method and device
CN116522212B (en) Lie detection method, device, equipment and medium based on image text fusion
CN111414959A (en) Image recognition method and device, computer readable medium and electronic equipment
CN114596609B (en) Audio-visual falsification detection method and device
CN111898576B (en) Behavior identification method based on human skeleton space-time relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant