CN111862990A - Speaker identity verification method and system - Google Patents

Speaker identity verification method and system Download PDF

Info

Publication number
CN111862990A
CN111862990A CN202010705582.9A CN202010705582A CN111862990A CN 111862990 A CN111862990 A CN 111862990A CN 202010705582 A CN202010705582 A CN 202010705582A CN 111862990 A CN111862990 A CN 111862990A
Authority
CN
China
Prior art keywords
feature embedding
embedding
facial
feature
identity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010705582.9A
Other languages
Chinese (zh)
Other versions
CN111862990B (en
Inventor
钱彦旻
陈正阳
王帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN202010705582.9A priority Critical patent/CN111862990B/en
Publication of CN111862990A publication Critical patent/CN111862990A/en
Application granted granted Critical
Publication of CN111862990B publication Critical patent/CN111862990B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Acoustics & Sound (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Collating Specific Patterns (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a speaker identity verification method, which comprises the following steps: acquiring audio data and facial image data of the speaker; extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data; and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification. The invention provides a scheme for carrying out personal identity verification by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be carried out due to the fact that the identity verification is easily influenced by external factors under a single mode is solved, and the success rate of identity verification is improved.

Description

Speaker identity verification method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a speaker identity verification method and a speaker identity verification system.
Background
The speaker identity verification methods in the prior art include a voiceprint-based verification method and a face recognition-based verification method. These techniques use certain physiological characteristics of a person to achieve the goal of verifying the identity of a person. A certain physiological characteristic of a person may in some cases not have a condition to distinguish a certain person. For example, in a noisy environment, we may not hear the sound of a particular person; a human face feature may not have a condition for distinguishing one person when one person twists or while he/she is moving.
Disclosure of Invention
The embodiment of the invention provides a speaker identity verification method and system, which are used for solving at least one of the technical problems.
In a first aspect, an embodiment of the present invention provides a speaker identity verification method, including:
acquiring audio data and facial image data of the speaker;
extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data;
and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification.
In a second aspect, an embodiment of the present invention provides a speaker identity verification system, including:
the audio-visual data acquisition module is used for acquiring audio data and facial image data of the speaker;
the feature extraction module is used for extracting voice feature embedding from the audio data and extracting facial feature embedding from the facial image data;
and the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification.
In a third aspect, an embodiment of the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any speaker authentication method of the present invention.
In a fourth aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speaker verification methods of the present invention described above.
In a fifth aspect, the present invention further provides a computer program product, which includes a computer program stored on a storage medium, the computer program including program instructions, which when executed by a computer, cause the computer to execute any one of the above speaker authentication methods.
The embodiment of the invention has the beneficial effects that: the scheme for carrying out the identity verification of the person by using multi-modal information (from the face and the voice) is provided, the problem that the identity verification cannot be carried out due to the fact that the identity verification cannot be carried out under the single mode is easily affected by external factors is solved, and the success rate of the identity verification is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a speaker verification method of the present invention;
FIG. 2 is a flow chart of another embodiment of the speaker verification method of the present invention;
FIG. 3 is a functional block diagram of an embodiment of a speaker verification system of the present invention;
FIG. 4 is a functional block diagram of another embodiment of a speaker verification system of the present invention;
FIG. 5a is a schematic illustration of simple soft attention fusion as used in the present invention;
FIG. 5b is a schematic diagram of a compact bilinear pooling fusion used in the present invention;
FIG. 5c is a schematic representation of portal multimodal fusion used in the present invention;
FIG. 6a is a graph of the distance between positive and negative pairs at the loss of original contrast in the present invention;
FIG. 6b is a graph of the distance between positive and negative pairs at the new loss of contrast in the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The invention provides a speaker identity verification method, which can be used for terminal equipment, wherein the terminal equipment has the functions of face recognition and voiceprint recognition at the same time, for example, the terminal equipment can be a smart phone, a tablet personal computer, a smart sound box, a vehicle terminal, a smart robot and the like, and the invention is not limited to the above.
As shown in fig. 1, an embodiment of the present invention provides a speaker authentication method, including:
And S10, acquiring the audio data and the facial image data of the speaker.
Illustratively, the terminal device executing the method is a smart phone, and the audio data of the speaker can be acquired through a microphone of the smart phone, and meanwhile, the face image of the speaker is acquired through a camera of the smart phone.
And S20, extracting voice feature embedding from the audio data, and extracting facial feature embedding from the facial image data.
Illustratively, the speech feature embedding may be a voiceprint feature. Voiceprint and facial feature embedding can be extracted using Resnet34 and sernet 50, respectively.
And S30, determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification.
The embodiment of the invention provides a scheme for carrying out personal identity verification by using multi-modal information (from human faces and sounds), so that the problem that identity verification cannot be carried out due to the fact that the identity verification is easily influenced by external factors under a single mode is solved, and the success rate of identity verification is improved.
Fig. 2 is a flow chart of another embodiment of the speaker verification method of the present invention, in which the determining identity feature embedding according to the voice feature embedding and the facial feature embedding includes:
And S31, inputting the voice feature embedding into the first embedding feature conversion layer to obtain the pre-processing voice feature embedding.
And S32, inputting the facial feature embedding into a second embedded feature conversion layer to obtain preprocessed facial feature embedding.
Illustratively, by translation layer ftrans_fAnd ftrans_vE is to befAnd evRespectively convert into
Figure BDA0002594571350000051
And
Figure BDA0002594571350000052
Figure BDA0002594571350000053
Figure BDA0002594571350000054
Figure BDA0002594571350000055
after conversion
Figure BDA0002594571350000056
And
Figure BDA0002594571350000057
in the co-embedding space, which is more suitable for later fusion.
And S33, carrying out fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure BDA0002594571350000058
determining a weighting factor from the attention score:
Figure BDA0002594571350000059
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure BDA00025945713500000510
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure BDA00025945713500000511
in order to pre-process the speech feature embedding,
Figure BDA00025945713500000512
and embedding for preprocessing facial features.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(fatt([ef,ev]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure BDA0002594571350000061
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure BDA0002594571350000062
for pre-processing speechThe characteristics are embedded into the image to be displayed,
Figure BDA0002594571350000063
for preprocessed facial feature embedding, an indicates a element-by-element product.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
As shown in fig. 3, an embodiment of the present invention further provides a speaker identity verification system 300, which may be used in a terminal device, where the terminal device has both functions of face recognition and voiceprint recognition, for example, the terminal device may be a smart phone, a tablet computer, a smart speaker, a car terminal, a smart robot, and the like, which is not limited in this respect.
As shown in fig. 3, the speaker authentication system 300 includes:
an audio-visual data acquisition module 310 for acquiring audio data and facial image data of the speaker;
a feature extraction module 320 for extracting speech feature embeddings from the audio data and facial feature embeddings from the facial image data;
an identity feature embedding determination module 330, configured to determine identity feature embedding according to the voice feature embedding and the facial feature embedding, for performing speaker authentication.
Fig. 4 is a schematic block diagram of another embodiment of the speaker verification system of the present invention, in which the identity embedding determination module includes:
a first embedded feature conversion layer 331, configured to perform preprocessing on the voice feature embedding to obtain a preprocessed voice feature embedding;
A second embedded feature conversion layer 332, configured to perform preprocessing on the facial feature embedding to obtain a preprocessed facial feature embedding;
and a fusion module 333, configured to perform fusion processing on the preprocessed voice feature embedding and the preprocessed facial feature embedding to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure BDA0002594571350000071
determining a weighting factor from the attention score:
Figure BDA0002594571350000072
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure BDA0002594571350000073
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure BDA0002594571350000074
in order to pre-process the speech feature embedding,
Figure BDA0002594571350000075
and embedding for preprocessing facial features.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
Illustratively, the fusing the pre-processed speech feature embedding and the pre-processed facial feature embedding to obtain the identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(fatt([ef,ev]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure BDA0002594571350000081
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure BDA0002594571350000082
in order to pre-process the speech feature embedding,
Figure BDA0002594571350000083
for preprocessed facial feature embedding, an indicates a element-by-element product.
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above speaker authentication methods of the present invention.
In some embodiments, the present invention also provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speaker authentication methods described above.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speaker verification method.
In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the speaker authentication method.
The speaker authentication system according to the embodiment of the present invention may be used to execute the speaker authentication method according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the speaker authentication method according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
In order to more clearly describe the technical solutions of the present invention and to more directly prove the real-time performance and the benefits of the present invention over the prior art, the technical background, the technical solutions, the experiments performed, etc. of the present invention will be described in more detail below.
Abstract
Information from different forms will usually compensate for each other. In the present invention we use the audio-visual data in the VoxCeleb dataset for person verification. We explored different information fusion strategies and loss functions for audiovisual personnel verification systems at the level of embedded features. System performance was evaluated using a common test list on the VoxCeleb1 data set. We achieved 0.585%, 0.427%, and 0.735% EER on the three public test lists of VoxCeleb1 using the best system of audiovisual knowledge at the embedded feature level, which is the best reported result on this data set. Furthermore, to mimic a more complex test environment of modal impairments or deletions, we constructed a noisy evaluation set based on the VoxCeleb1 data set. We use data enhancement strategies at the embedding feature level to help our audiovisual system distinguish between noisy and clean embedding. With this data enhancement strategy, the proposed audiovisual personnel verification system is more powerful on noisy evaluation sets.
1. Introduction to
A variety of biometrics may be used to verify a person's identity, with voice and facial expressions being two typical features. Face verification and speaker verification are subjects of intense research in the field of biometric identification. Deep learning techniques have recently greatly improved the performance of both tasks. Over the past few years researchers have studied different architectures and different loss functions, resulting in well-performing systems that can even be commercialized for real world applications.
Despite the success in single-modality applications, multi-modality learning has attracted increasing attention in both academic and industrial areas. The motivation has two aspects:
1. complementary information from different modalities may improve system performance.
2. Models built from multiple modalities tend to be more robust and fault tolerant and can repair or suppress faults in a single mode.
In the present invention, cross-modal integration is performed at the embedding level, while speaker embedding using more powerful segment-level training. Different fusion strategies and loss functions were studied and compared in a multimodal learning framework.
Furthermore, to mimic real-world scenarios, we constructed a noisy evaluation set in which one modality has been corrupted or missing. In order to compensate performance reduction, a new method for enhancing Noise Distribution Matching (NDM) data of an embedded feature level is provided, and the method greatly improves performance under a noisy condition.
All systems were evaluated on the standard Vox-Celeb1 dataset, while our best multimodal system achieved EERs of 0.585%, 0.427%, and 0.735% on the three test lists (Vox-Celeb1_ O, VoxCeleb1_ E, and VoxCeleb1_ H), respectively. To our knowledge, this is the best reported result on the data set. Furthermore, NDM-based multimodal systems show the ability to select more prominent modality information when evaluating on a noisy evaluation set.
2. Method of producing a composite material
2.1 feature level-Embedded multimodal fusion
In this section, we will describe three ways of embedding e facial featuresfAnd speech feature embedding evFusion into identity feature embedding epThe method of (1). As shown in fig. 5a to 5c, first by the conversion layer ftrans_fAnd ftrans_vE is to befAnd evRespectively convert into
Figure BDA0002594571350000101
And
Figure BDA0002594571350000102
Figure BDA0002594571350000103
after conversion
Figure BDA0002594571350000104
And
Figure BDA0002594571350000105
in the co-embedding space, which is more suitable for later fusion.
2.1.1 simple Soft attention fusion
In this section, we first introduce a Simple Soft Attention (SSA). As shown in FIG. 5a, given face and speech features are embedded efAnd evThrough the attention layer fatt(. score attention)
Figure BDA0002594571350000106
Is defined as:
Figure BDA0002594571350000107
then through weighted sum computation fusion embedding:
Figure BDA0002594571350000108
2.1.2 compact bilinear pooling fusion
Bilinear pooling exploits the outer product operation to fully explore the relationship between two vectors and does not involve training parameters. However, it is generally not feasible in practice due to the high dimension of the outer product. A method called multi-modal compact bilinear pooling (MCB) has been proposed in the prior art to approximate the outer product result and at the same time reduce the dimension of the result. Notably, there are no training parameters in the MCB either. As shown in FIG. 5b, we use compact bilinear pooling directly
Figure BDA0002594571350000109
And
Figure BDA00025945713500001010
fusion to ep. Details of the implementation of compact bilinear pooling can be found in the prior art, and the present invention is not limited in this regard.
2.1.3 Portal multimodal fusion
In this section, we use GATEs to control the flow of information for face and speech modalities, which we call gated multi-modal fusion (GATE). As shown in FIG. 5c, given face and speech features are embedded efAnd evThe exit vector z ∈ R can be calculatedD
z=σ(fatt([ef,ev]))
Then, we use the gate vector z to
Figure BDA0002594571350000111
To know
Figure BDA0002594571350000112
Fusion to epAnd, indicates a product element by element:
Figure BDA0002594571350000113
2.2 loss function
In this section, we will introduce a loss function for optimizing the proposed multimodal fusion system.
2.2.1 contrast loss for aggressive sampling strategy
The original contrast loss is defined as:
Figure BDA0002594571350000114
where D is the distance between a pair and N and M are the number of positive and negative pairs in a batch. y-1 and y-0 represent positive and negative pairs, respectively, and m is the margin. In our experiment, we use cosine similarity to measure the distance of the embedded pair.
The adjusted margin m in the original contrast loss makes the loss more concerned about the difficult negative pairs. However, the difficult alignment is not considered. Here we introduce a more aggressive sampling strategy. During the training process, after the neural network is propagated in the forward direction, we calculate the loss using only a subset of γ M most difficult negative pairs and γ N most difficult positive pairs (γ e (0, 1)), and the contrast loss using the new sampling strategy can be defined as:
Figure BDA0002594571350000115
Wherein D isp_lowDenotes the minimum distance, D, of all "hardest" positive numbersn_highRepresenting the maximum distance among all "hardest" negatives.
2.2.2 addition Angle margin loss
In addition, we also tried the prevailing loss of angular margin in the experiment. For tags with personal identities ysThe loss is defined as:
Figure BDA0002594571350000116
where m is the addition margin and s is the scale parameter, which may help the model to converge faster. In our experiment, s was set to 32 and m was set to 0.6 in the fusion system.
2.3 Embedded layer enhancement for noisy evaluation
2.3.1 noisy evaluation set construction
Information from different modalities is not always available or significant enough to perform the verification task. In practical applications, a modality is often damaged or missing due to certain unavoidable external factors (e.g., ambient light, human movement, or background noise). To address this situation, we constructed a noisy evaluation set based on the VoxCeleb1 evaluation set.
For image data, we use vertical and horizontal motion blur to mimic human motion in front of the lens, and gaussian blur to mimic other noise. For audio data, the three types of noise in Musan are combined with the original data to generate corrupted audio samples. We also consider the case of a complete absence of one modality by directly setting the corresponding extracted embeddings to zero values. A detailed flow of constructing this data set is shown in algorithm 1.
Algorithm 1: noisy evaluation set structure
Figure BDA0002594571350000121
Figure BDA0002594571350000131
2.3.2 Embedded layer enhancement
In order to build a system that is more robust to corrupted audiovisual data, an additional embedded layer enhancement strategy is proposed in this work. In our previous work, we used deep generation models such as the generative countermeasure network (GAN) or the Variational Autocoder (VAE) to simulate the distribution of noisy speaker embeddings. Here, a simple statistical-based distribution matching algorithm is used.
We randomly selected 100,000 records (1,092,009 records) from the training set and generated different types of corrupt data. Then, for each noise type, we assume that the difference between the noise embedding and the original embedding can be described by a gaussian distribution. After estimating the parameters of the noise distribution, we sample the noise in the distribution and add it directly to the original embedding to generate a noisy embedding. We refer to this embedding layer enhancement method as Noise Distribution Matching (NDM). Compared to adding noise directly to the entire training set and extracting enhanced embedding, NDM uses only a small portion of the training data and directly enhances embedding, saving time and disk. Furthermore, we still use the zero vector to model the case of modal loss.
3. Experimental setup
3.1, data set
In our experiments we used both visual and audio data from both the VoxCeleb1 and 2 data sets. For training, we used the DEV portion of the VoxCeleb2 data set, which includes 5,994 speakers and 1,092,009 utterances. VoxCeebb 1 was used as the evaluation set, and all three official test lists VoxCeebb 1-O, VoxCeebb 1-E and VoxCeebb 1-H were used as the evaluations. Note that the visual data from the official VoxCeleb1 dataset is incomplete and we downloaded the missing visual data from youtube and published it.
3.2 Experimental setup
3.2.1 Single Modal System
For audio data, 40-dimensional Fbank features were extracted using the Kaldi toolkit and silence frames were deleted using an energy-based voice activity detector. Then we perform CMN on the Fbank function using the sliding window size 300. For video data, we extract 1 frame per second. Then, we detect the face landmark using MTCNN and map the face region to the same shape using similarity transformation (3x112x 96). Finally, we normalize the pixel values of each image to [0, 1] and subtract 0.5 to map the range of values to [ -0.5, 0.5 ].
During the training process, the Fbank feature from a sentence is divided into blocks, with block sizes from 200 to 400. During the test, we extract one speech embedding for each recording, and then average the face embeddings of one recording to obtain one face representation.
In our experiment, a 50-layer SE-ResNet was used for the face system and a 34-layer ResNet was used for the voice system. The embedding of both systems is set to 512 dimensions. The AAM loss with margin (margin) m of 0.2 is used to optimize both systems.
3.2.2 multimodal System
Facial and speech feature embeddings are extracted from the single modality system for all recordings in the training set. Then, L2 normalization was performed on all the embeddings to build a new training set for the audiovisual multimodal system.
For the SSA fusion system, the translation layers are two fully connected layers each having 512 units, and the attention layer is a fully connected layer having two cells. For compact bilinear fusion and gated multi-model fusion, the translation layer is a fully-connected layer with 512 cells. The attention layer in the gated multi-model fusion system is two fully connected layers with 32 and 512 cells respectively. For all the above adjacent fully connected layers, we insert another batchnorm and relu layer in the middle.
4. Results and analysis
4.1 evaluation of embedding layer multimodal fusion
In order to fuse information in face and voice modalities, different fusion strategies and different loss functions were explored and compared in our embedded-level fusion system. The results and analysis are presented in this section.
The results for the single modality system are shown at the top of table 1. We find face and speech single modality systems to be substantially comparable. As shown in the third row of table 1, the simple average score results between the two single-modality systems greatly surpassed the two single-modality systems, indicating a strong complementary role between audio and visual modalities.
4.1.1 loss function comparison
First, the SSA fusion strategy under controlled loss supervision was investigated. However, as shown in the middle of table 1, in our experiments the original contrast loss based system did not converge to the optimum and the performance of the fused system was much worse than even the single modality system. To increase the comparable loss, a revision with the more aggressive sampling strategy introduced in section 2.2.1 was used, with much better results (SSA + Con-new). To more intuitively demonstrate the effectiveness of the new strategy, the distribution of the distance between the positive and negative pairs is shown in fig. 6a and 6b, where fig. 6a is a plot of the distance between the positive and negative pairs at the original contrast loss and fig. 6a is a plot of the distance between the positive and negative pairs at the new contrast loss. Indicating that a new loss of contrast can enlarge the difference between the positive and negative distances. Furthermore, in addition to the destructive losses, we also used AAM-Softmax losses based on classification for multi-modal system optimization, which outperforms the destructive losses far. AAM-softmax and new comparative losses will be used mainly in the following experiments.
4.1.2 fusion strategy comparison
In this section the different fusion strategies introduced in section 2.1 are compared, while the AAM-softmax loss or new contrast loss provides the monitoring signal. The results are shown in the middle part of the table. From the results, all three fusion strategies achieved significant improvements compared to the unimodal system, and the gated multi-modal fusion architecture performed best. However, simple score averaging still performed best, which is not consistent with the findings in [23 ]. The possible reason is that we have a more powerful single-modality system in this work: using the same VoxColeb 2 test list, our face and speech EERs are 4.08% and 3.43%, respectively, while the corresponding numbers in [23] are 14.5% and 8.03% 2. the larger difference can also be attributed to different experimental settings, we have employed segment-level optimization in the system, while the authors in [23] use a frame-level embedding extractor for online verification.
Further, when we used both AAM loss and new contrast loss, further improvement can be obtained and the performance on the Voxceleb 1E and Voxceleb 1H paths exceeds the score average results. The results are shown in the penultimate row of the table. 1. Surprisingly, we found that a fusion system using the proposed model is complementary to a simple score-averaging system. When we further averaged the score of the GATE + AAM + Con-new fusion system with the average score of the unimodal system, the best system performance was obtained. To our knowledge, this is also the best release result for human validation on the VoxCeleb1 evaluation dataset.
Table 1: results of different fusion strategies and losses were used for comparison. The disadvantages are as follows: the original contrast is lost. The method comprises the following steps: the contrast loss suggested by using a more aggressive sampling strategy. M in Con-orig is set to 0.5, and m in Con-new is set to 0.05.
Figure BDA0002594571350000161
4.2 evaluation of Damage and deletion modalities
To test fusion systems under more complex realistic conditions, where one mode is corrupted or lost, the results are evaluated using the noisy evaluation set shown in section 2.3.1 and displayed in a table. 2. From the results, we found that simple fractional averaging operations can still significantly improve performance, and the proposed multimodal fusion system trained with enhanced embedded data achieves the best results in this case. Furthermore, audiovisual fusion systems that train only clean embedding do not have the ability to distinguish noise embedding from clean embedding well and the results are somewhat worse. Note that the results in parentheses indicate that the proposed fusion system using enhanced embedded training still performs well on a clean evaluation set.
Table 2: results (EER%) comparison over a noisy evaluation set. We used here the GATE + AMM + Con-new fusion system. Train _ Clean: the fusion system is trained with clean embedding. Train _ Noise: the fusion system is trained with enhanced noise embedding. The results in brackets were tested in a clean evaluation set.
Figure BDA0002594571350000162
5. Conclusion
In this context, we explore different multimodal fusion strategies and loss functions for people verification systems and can efficiently combine audiovisual information at the embedding level. Based on a powerful single-modality system, our best system achieved 0.585%, 0.427%, and 0.735% EER on three official test lists of VoxColeb 1, which to our knowledge is the best result published on this data set. We also introduce an embedded level data enhancement method that can help audiovisual multi-modal people verification systems perform well when certain modalities are damaged or lost.
Fig. 7 is a schematic hardware structure diagram of an electronic device for performing a speaker authentication method according to another embodiment of the present application, and as shown in fig. 7, the electronic device includes:
one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7.
The apparatus performing the speaker authentication method may further include: an input device 730 and an output device 740.
The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7.
The memory 720, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speaker identification verification method in the embodiments of the present application. The processor 710 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 720, so as to implement the speaker identity authentication method of the above-described method embodiments.
The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the speaker authentication apparatus, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 720 optionally includes memory located remotely from processor 710, which may be connected to the speaker authentication device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may receive input numeric or character information and generate signals related to user settings and function control of the speaker authentication device. The output device 740 may include a display device such as a display screen.
The one or more modules are stored in the memory 720 and, when executed by the one or more processors 710, perform the speaker verification exercise method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (12)

1. A speaker identity verification method, comprising:
acquiring audio data and facial image data of the speaker;
extracting voice feature embeddings from the audio data, extracting facial feature embeddings from the facial image data;
and determining identity feature embedding according to the voice feature embedding and the facial feature embedding for speaker identity verification.
2. The method of claim 1, wherein the determining identity feature embedding from the voice feature embedding and the facial feature embedding comprises:
embedding and inputting the voice features into a first embedded feature conversion layer to obtain preprocessed voice feature embedding;
Inputting the facial feature embedding into a second embedding feature conversion layer to obtain preprocessed facial feature embedding;
and carrying out fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding.
3. The method of claim 2, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure FDA0002594571340000011
determining a weighting factor from the attention score:
Figure FDA0002594571340000012
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure FDA0002594571340000013
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure FDA0002594571340000014
in order to pre-process the speech feature embedding,
Figure FDA0002594571340000021
and embedding for preprocessing facial features.
4. The method of claim 2, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
and performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
5. The method of claim 2, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(fatt([ef,ev]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure FDA0002594571340000022
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure FDA0002594571340000023
in order to pre-process the speech feature embedding,
Figure FDA0002594571340000024
for preprocessed facial feature embedding, an indicates a element-by-element product.
6. A speaker authentication system, comprising:
the audio-visual data acquisition module is used for acquiring audio data and facial image data of the speaker;
the feature extraction module is used for extracting voice feature embedding from the audio data and extracting facial feature embedding from the facial image data;
and the identity characteristic embedding determination module is used for determining identity characteristic embedding according to the voice characteristic embedding and the facial characteristic embedding so as to carry out speaker identity verification.
7. The system of claim 6, wherein the identity embedding determination module comprises:
The first embedded feature conversion layer is used for preprocessing the voice feature embedding to obtain preprocessed voice feature embedding;
the second embedded feature conversion layer is used for preprocessing the facial feature embedding to obtain preprocessed facial feature embedding;
and the fusion module is used for carrying out fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding so as to obtain identity feature embedding.
8. The system of claim 7, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining, by an attention layer, an attention score from the speech feature embedding and the facial feature embedding:
Figure FDA0002594571340000031
determining a weighting factor from the attention score:
Figure FDA0002594571340000032
determining identity feature embedding from the weighting coefficients and the pre-processed speech feature embedding and the pre-processed facial feature embedding:
Figure FDA0002594571340000033
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure FDA0002594571340000034
in order to pre-process the speech feature embedding,
Figure FDA0002594571340000035
and embedding for preprocessing facial features.
9. The system of claim 7, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
And performing fusion processing on the preprocessed voice feature embedding and the preprocessed face feature embedding in a compact bilinear pooling mode to obtain identity feature embedding.
10. The system of claim 7, wherein the fusing the pre-processed speech feature embedding and pre-processed facial feature embedding to obtain identity feature embedding comprises:
determining a gate vector from the speech feature embedding and the facial feature embedding:
z=σ(fatt([ef,ev]))
adopting the gate vector to fuse the preprocessed voice feature embedding and the preprocessed face feature embedding to obtain identity feature embedding:
Figure FDA0002594571340000036
wherein e isvFor speech feature embedding, efFor the purpose of facial feature embedding,
Figure FDA0002594571340000041
in order to pre-process the speech feature embedding,
Figure FDA0002594571340000042
for preprocessed facial feature embedding, an indicates a element-by-element product.
11. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-5.
12. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN202010705582.9A 2020-07-21 2020-07-21 Speaker identity verification method and system Active CN111862990B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010705582.9A CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010705582.9A CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Publications (2)

Publication Number Publication Date
CN111862990A true CN111862990A (en) 2020-10-30
CN111862990B CN111862990B (en) 2022-11-11

Family

ID=73000790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010705582.9A Active CN111862990B (en) 2020-07-21 2020-07-21 Speaker identity verification method and system

Country Status (1)

Country Link
CN (1) CN111862990B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995657A (en) * 2022-07-18 2022-09-02 湖南大学 Multimode fusion natural interaction method, system and medium for intelligent robot
CN116504226A (en) * 2023-02-27 2023-07-28 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117011924A (en) * 2023-10-07 2023-11-07 之江实验室 Method and system for estimating number of speakers based on voice and image
CN117155583A (en) * 2023-10-24 2023-12-01 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification
CN110674483A (en) * 2019-08-14 2020-01-10 广东工业大学 Identity recognition method based on multi-mode information
US20200065612A1 (en) * 2018-08-27 2020-02-27 TalkMeUp Interactive artificial intelligence analytical system
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN111310648A (en) * 2020-02-13 2020-06-19 中国科学院西安光学精密机械研究所 Cross-modal biometric feature matching method and system based on disentanglement expression learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200065612A1 (en) * 2018-08-27 2020-02-27 TalkMeUp Interactive artificial intelligence analytical system
CN109910818A (en) * 2019-02-15 2019-06-21 东华大学 A kind of VATS Vehicle Anti-Theft System based on human body multiple features fusion identification
CN110674483A (en) * 2019-08-14 2020-01-10 广东工业大学 Identity recognition method based on multi-mode information
CN111310648A (en) * 2020-02-13 2020-06-19 中国科学院西安光学精密机械研究所 Cross-modal biometric feature matching method and system based on disentanglement expression learning
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995657A (en) * 2022-07-18 2022-09-02 湖南大学 Multimode fusion natural interaction method, system and medium for intelligent robot
CN116504226A (en) * 2023-02-27 2023-07-28 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN116504226B (en) * 2023-02-27 2024-01-02 佛山科学技术学院 Lightweight single-channel voiceprint recognition method and system based on deep learning
CN117011924A (en) * 2023-10-07 2023-11-07 之江实验室 Method and system for estimating number of speakers based on voice and image
CN117011924B (en) * 2023-10-07 2024-02-13 之江实验室 Method and system for estimating number of speakers based on voice and image
CN117155583A (en) * 2023-10-24 2023-12-01 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion
CN117155583B (en) * 2023-10-24 2024-01-23 清华大学 Multi-mode identity authentication method and system for incomplete information deep fusion

Also Published As

Publication number Publication date
CN111862990B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN111862990B (en) Speaker identity verification method and system
US20220058426A1 (en) Object recognition method and apparatus, electronic device, and readable storage medium
CN109637546B (en) Knowledge distillation method and apparatus
US11068571B2 (en) Electronic device, method and system of identity verification and computer readable storage medium
US9105119B2 (en) Anonymization of facial expressions
CN110956966B (en) Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment
CN111260620B (en) Image anomaly detection method and device and electronic equipment
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
WO2020124993A1 (en) Liveness detection method and apparatus, electronic device, and storage medium
CN113537005A (en) On-line examination student behavior analysis method based on attitude estimation
WO2021203880A1 (en) Speech enhancement method, neural network training method, and related device
CN112149615A (en) Face living body detection method, device, medium and electronic equipment
CN114616565A (en) Living body detection using audio-visual disparity
CN109214616B (en) Information processing device, system and method
CN110232927B (en) Speaker verification anti-spoofing method and device
CN111259759B (en) Cross-database micro-expression recognition method and device based on domain selection migration regression
Yu et al. Cam: Context-aware masking for robust speaker verification
CN111414959A (en) Image recognition method and device, computer readable medium and electronic equipment
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
CN111898576B (en) Behavior identification method based on human skeleton space-time relationship
CN113343898B (en) Mask shielding face recognition method, device and equipment based on knowledge distillation network
CN117079336B (en) Training method, device, equipment and storage medium for sample classification model
Zhang et al. Qiniu Submission to ActivityNet Challenge 2018
CN106251366A (en) Use the system that many individuals are detected and follow the trail of by multiple clue automatically
Fauzi et al. Development of Active Liveness Detection System Based on Deep Learning ActivenessNet to Overcome Face Spoofing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant