CN114283817A

CN114283817A - Speaker verification method and system

Info

Publication number: CN114283817A
Application number: CN202111617782.XA
Authority: CN
Inventors: 钱彦旻; 韩冰; 陈正阳; 刘贝
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-05

Abstract

The embodiment of the invention provides a speaker verification method. The method comprises the following steps: pre-dividing a feature map of time and frequency dimensions of an audio to be verified to obtain a plurality of feature segments, wherein the plurality of feature segments comprise: feature segments of the time one-dimensional features, and feature segments of the time and frequency two-dimensional features; determining global information of the characteristic segments of the time and frequency two-dimensional characteristics by utilizing the multilayer sensing blocks, and determining local information of the characteristic segments of the time one-dimensional characteristics; and determining the global information and the local information of the speaker embedding with the global information and the local information by utilizing the statistical pooling layer, and performing speaker verification by utilizing the speaker embedding. The embodiment of the invention also provides a speaker verification system. The speaker system based on the multilayer perceptron can simultaneously model local and global information, and has advantages in capturing global characteristics and local characteristics in audio to be verified, so that the accuracy of speaker verification is improved.

Description

Speaker verification method and system

Technical Field

The invention relates to the field of intelligent voice, in particular to a speaker verification method and a speaker verification system.

Background

Speaker verification is a task of verifying the identity of a speaker using speech as a biometric. In recent years, the end-to-end deep embedding learning method is widely applied to the speaker verification task and achieves good effects. In general, the speaker verification model architecture consists of three deep neural network components, including a frame-level feature extractor, an utterance-level representation aggregator, and a speaker classifier.

To further improve the performance of speaker verification, different kinds of models of convolutional network structures are used. For example, X-vector, R-vector and S-vector, wherein the X-vector is a speaker verification system composed of multiple layers of one-dimensional convolution, the R-vector is a speaker verification system composed of multiple layers of two-dimensional convolution, and the S-vector is a speaker verification system with a transformer as a main framework, and is mainly based on a self-attention mechanism.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the X-vector and the R-vector are mainly based on a convolution network, are limited by a receptive field, are good at modeling local information, and are easy to ignore global information. Whereas based on self-attentive S-vector, global information can be modeled, but local information is easily ignored. Therefore, the global information or the local information of the speaker characteristics can be easily ignored by the models, and the models are not accurate enough when the speaker voice with the local characteristics or the global characteristics being more important is verified.

Disclosure of Invention

The method and the device aim to at least solve the problems that global information or local information of speaker characteristics is easy to ignore and the accuracy is not high enough when the voice of a speaker with important local characteristics or global characteristics is verified in the prior art. In a first aspect, an embodiment of the present invention provides a speaker verification method, including:

pre-dividing a feature map of time and frequency dimensions of an audio to be verified to obtain a plurality of feature segments, wherein the plurality of feature segments comprise: feature segments of the time one-dimensional features, and feature segments of the time and frequency two-dimensional features;

determining global information of the characteristic segments of the time and frequency two-dimensional characteristics by utilizing a multilayer sensing block, and determining local information of the characteristic segments of the time one-dimensional characteristics;

and determining the global information and the speaker embedding with the global information and the local information of the local information by utilizing a statistic pooling layer, and performing speaker verification by utilizing the speaker embedding.

In a second aspect, an embodiment of the present invention provides a speaker verification system, including:

the device comprises a feature segment segmentation program module, a feature segment analysis module and a feature segment verification module, wherein the feature segment segmentation program module is used for pre-segmenting a feature map of a time dimension and a frequency dimension of an audio to be verified to obtain a plurality of feature segments, and the plurality of feature segments comprise: feature segments of the time one-dimensional features, and feature segments of the time and frequency two-dimensional features;

the information determining program module is used for determining global information of the characteristic segments of the time and frequency two-dimensional characteristics by utilizing the multilayer sensing blocks and determining local information of the characteristic segments of the time one-dimensional characteristics;

and the speaker verification program module is used for determining the global information and the speaker embedding with the global information and the local information of the local information by utilizing a statistical pooling layer and performing speaker verification by utilizing the speaker embedding.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speaker verification method according to any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the speaker verification method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the speaker system based on the multilayer perceptron can simultaneously model local and global information, has advantages in capturing global characteristics and local characteristics in audio to be verified, and accordingly improves accuracy of speaker verification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speaker verification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a multi-layered perceptron-based speaker verification network architecture for a speaker verification method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the results of various additional methods of speaker verification according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the results of different block sizes of a speaker verification method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the results of different numbers of MLP blocks of a speaker verification method according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating comparison results between a speaker verification method and other speaker verification systems according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the result of the ECAPA-TDNN fusion of the different systems of a speaker verification method according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a speaker verification system according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an embodiment of an electronic device for speaker verification according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a speaker verification method according to an embodiment of the present invention, which includes the following steps:

s11: pre-dividing a feature map of time and frequency dimensions of an audio to be verified to obtain a plurality of feature segments, wherein the plurality of feature segments comprise: feature segments of the time one-dimensional features, and feature segments of the time and frequency two-dimensional features;

s12: determining global information of the characteristic segments of the time and frequency two-dimensional characteristics by utilizing a multilayer sensing block, and determining local information of the characteristic segments of the time one-dimensional characteristics;

s13: and determining the global information and the speaker embedding with the global information and the local information of the local information by utilizing a statistic pooling layer, and performing speaker verification by utilizing the speaker embedding.

In the present embodiment, the structure of the method is as shown in fig. 2, and the method includes the following steps as a whole: Pre-Patch Pre-segmentation Block (or Pre-segmentation module), MLP Block multi-layer sensing Block (or multi-layer sensing module), statistical Pooling layer, and classifier module. Where the pre-patch section uses a sliding window to aggregate proximity information and slices the input into equal length segments. The MLP Block is the most important part of the method, and consists of two Mixer Block mixed blocks with a time dimension and a characteristic dimension, and the model is respectively modeled at the time level and the characteristic level. Finally, speaker insertions were made using statistical posing layers to pool sequences to fixed lengths.

For step S11, stacking each training frame with its left and right context frames may provide a better performance basis than a single frame. Inspired by this, the neighbor information is encoded by the pre-segmentation module of the method. As shown on the left in fig. 2, the input X is a feature map whose dimensions consist of time and frequency. The feature map may then be segmented into overlapping feature segments using sliding windows. Finally, all feature segments are flattened into embeddings and encoded as a series of fixed-dimension embeddings that serve as inputs to the multi-layered perceptual block.

In this method, two methods of patching (introducing local, global) features are proposed: time one-dimensional features and time and frequency two-dimensional features. Patch 1D (time one dimensional feature) means to segment the feature map in the time dimension and stack the frame with its neighbors. The feature map is treated as an image and is segmented into map feature segments in the patch 2D (time and frequency two-dimensional feature) time dimension and frequency dimension.

For step S12, the MLP-SVNet (multi-layered perceptual-speaker verification network) is mainly composed of multiple MLP blocks with the same size, each MLP block is composed of two mixer blocks for modeling the time and frequency information of the signal, and the model is input

Where T and F are the time and frequency dimensions, respectively.

As an embodiment, the first mixing block is used to perform global information modeling on the feature segments of the time and frequency two-dimensional features to obtain global information, and the second mixing block is used to perform local information modeling on the time one-dimensional features to obtain local information stacked in adjacent frames.

The first mixing block and the second mixing block are formed by: two dense layers, a residual connecting layer and a Gaussian error linear unit.

In the present embodiment, as shown in fig. 2 (middle), the first mixing block is a temporary mixing block. It applies a dense transformation to the time dimension of the input. Since the time dimension is a column of feature mapping, transpose operations are added before and after the time mixing block for ease of implementation. The second is a mixing block (frequency) that applies a dense transform in the frequency dimension to mix frequency features.

For each blend block, it contains two dense layers, a residual connected layer, and a gaussian error linear unit, as shown in fig. 2 (right). As described above, an MLP (multi-layer perception) block can be written as follows:

Y＝Mixer((Mixer(X^T))^T)

wherein^TRepresenting a transpose operation in the exchange time and frequency dimensions. And, the mixer is defined as follows:

Mixer(X)＝X+W₂σ(W₁LN(X))

where σ is a function of the GELU (Gaussian error Linear Unit), LN denotes layer normalization. W₁And W₂Representing the transformation matrices of the two dense layers. Furthermore, each MLP block has the same size input. The method attempts a pyramid structure with deeper blocks with lower resolution and higher frequency, but the results are not good. Furthermore, the multi-layered perceptual-speaker verification network does not use any location embedding, because the MLP blocks are sensitive to the order in which the tokens are input. The multi-layer perception block carries out frame-by-frame processing on the characteristic segments of the time and frequency two-dimensional characteristics and the characteristic segments of the time one-dimensional characteristics to obtain the global information and the local information of the audio to be verified.

For step S13, speaker classification is performed using global information and local information, and in order to explicitly enhance the similarity of the intra-class samples and the diversity of the inter-class samples, the speaker embedding is densely classified by softmax loss function of AAM (Additive Angular Margin):

wherein the content of the first and second substances,

θ_j，iis a column vector W_jAnd embedding x_iAngle between W_jAnd x_iAre normalized. s is a scale factor and m is a hyperparameter controlling the margin. Therefore, whether the audio to be verified is the same as the pre-stored characteristic class of the speaker audio is obtained, and the speaker verification result of the audio to be verified is obtained.

According to the embodiment, the speaker system based on the multilayer perceptron can simultaneously model local and global information, and has advantages in capturing global features and local features in audio to be verified, so that the accuracy of speaker verification is improved.

Experiments on the method show that the performance of the multi-layer perception-speaker verification network is evaluated through a VoxColeb data set. The VoxCeleb2 development set is used for training. It includes 1092009 utterances of 5994 speakers, which are extracted from the YouTube video. To generate additional training samples and increase the diversity of the data, online data augmentation is performed using the MUSAN and RIR datasets. The noise types in MUSAN include ambient noise, music, babble noise for television and background additive noise. Enhancement data is generated by mixing noise with the original speech. For reverberation, a convolution operation is performed using 40000 simulated room impulse responses in the RIR dataset. During the training process, it is decided whether to increment each sample with a probability of 0.6.

A 40-dimensional filter bank is used with a 25ms window and 10ms displacement as acoustic features. All multi-layered perceptual-speaker verification networks were trained on 300-frame speech feature blocks. In the test procedure, each utterance is first divided into a number of 300-frame speech blocks. The embedding of each utterance is then obtained by averaging the extracted embeddings from these utterance blocks.

During the training process, the multi-layer perceptive-speaker verification network is optimized, the momentum is 0.9, and the weight attenuation is 1 e-4. Furthermore, to achieve better performance, the method uses AAM Softmax as a loss function. The margins of the scaling parameters and AAM are set to 32 and 0.2, respectively. The entire training process will last 165 phases and the learning rate will drop exponentially from 0.1 to 1 e-5. Training is done in parallel on 4 GPUs (graphics processing units), with the batch size set to 64.

The method investigates different repairing methods in MLP-SVNet (multilayer perception-speaker verification network), and the result is shown in FIG. 3. It was found that, of all results, patching 1D achieved the best performance in the text-independent speaker verification task. The results show that superimposing a frame with its neighboring content can better gather local information and bring significant performance gains. However, this phenomenon does not occur in the patching 2D method, which means that splitting along the frequency dimension is not conducive to extracting good speaker information.

As described above, the patching 1D is superior to other patching methods with the lowest EER (Equal Error Rate). Based on the patching 1D method, the effect of patch size is optimized as shown in fig. 4. From the results it was found necessary to introduce some local information by the patching method, but too large patch size would affect performance.

In the experiment, the influence of MLP block number on MLP-SVNet was also analyzed. The EER results for different block numbers are shown in fig. 5. It can be seen that the increase in the number of blocks only brings a little improvement and that MLP-SVNet can achieve comparable performance even if only a few blocks are used. This benefits from the superior ability of MLP SVNet to model global information. Using the time mixer block, the global information can be aggregated and mixed well.

As shown in FIG. 6, an overview of the performance of the MLP-SVNet system of the present method and other speaker verification systems is given. The results show that, except for the most advanced ECAPA-TDNN (explicit channel integration, propagation and aggregation in TDNN, time-lapse neural network emphasizing channel attention, propagation and aggregation), the MLP-SVNet performance of the method is obviously superior to most traditional systems, and the ECAPA-TDNN system has more special design to utilize multi-scale information. The results show that compared to other convolution-based or self-attention models, the MLP model has less induced bias and more trainable parameters, superior in capturing long range dependence and local features.

Since the MLP-SVNet of the present method is completely based on MLP, its architecture is very different from convolution-or self-attention-based models. The results of the different fusion systems are given as shown in fig. 7. The results indicate that the fusion of ECAPA-TDNN and MLP-SVNet provides the most significant performance gain, indicating that the method can produce the most complementary speaker insertions compared to X-vector, R-vector and S-vector.

In summary, the method proposes a new multilayer perceptron-based speaker verification network (MLP-SVNet) that does not use any convolution or self-attention mechanism. It applies MLP across time or frequency while modeling local and global information. Experimental results show that the MLP-SVNet can effectively treat X-vector, R-vector and S-vector. The results show that MLP-SVNet has advantages in capturing long-term dependencies and local features compared to other models. In addition, system performance can be further improved thanks to the completely different architecture of the MLP SVNet.

Fig. 8 is a schematic structural diagram of a speaker verification system according to an embodiment of the present invention, which can execute the speaker verification method according to any of the above embodiments and is configured in a terminal.

The speaker verification system 10 provided in this embodiment includes: a feature segmentation program module 11, an information determination program module 12 and a speaker verification program module 13.

The feature fragment segmentation program module 11 is configured to pre-segment a feature map of a time dimension and a frequency dimension of an audio to be verified to obtain a plurality of feature fragments, where the plurality of feature fragments include: feature segments of the time one-dimensional features, and feature segments of the time and frequency two-dimensional features; the information determining program module 12 is configured to determine global information of the feature segment of the time and frequency two-dimensional feature and determine local information of the feature segment of the time one-dimensional feature by using a multi-layer perceptual block; the speaker verification program module 13 is configured to determine speaker embedding with global information and local information of the global information and the local information by using the statistical pooling layer, and perform speaker verification by using the speaker embedding.

Further, each of the multi-layer sensing blocks includes a first mixed block and a second mixed block;

the information determination program module is to:

and performing global information modeling on the feature segments of the time and frequency two-dimensional features by using the first mixing block to obtain global information, and performing local information modeling on the time one-dimensional features by using the second mixing block to obtain local information stacked by adjacent frames.

Further, the speaker verification program module is configured to:

and densely classifying the speaker embedding through a softmax loss function with an additional corner margin, and classifying by strengthening the similarity of the samples in each class and the diversity of the samples among the classes.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the speaker verification method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the speaker verification method in any of the method embodiments described above.

Fig. 9 is a schematic diagram of a hardware structure of an electronic device for a speaker verification method according to another embodiment of the present application, and as shown in fig. 9, the electronic device includes:

one or more processors 910 and a memory 920, one processor 910 being illustrated in fig. 9. The apparatus of the speaker verification method may further include: an input device 930 and an output device 940.

The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and fig. 9 illustrates an example of a connection by a bus.

The memory 920 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the speaker verification method in the embodiment of the present application. The processor 910 executes various functional applications and data processing of the server by executing the nonvolatile software programs, instructions and modules stored in the memory 920, so as to implement the speaker verification method of the above-described method embodiment.

The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 920 may optionally include memory located remotely from the processor 910, which may be connected to a mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 930 may receive input numeric or character information. The output device 940 may include a display device such as a display screen.

The one or more modules are stored in the memory 920 and, when executed by the one or more processors 910, perform the speaker verification method in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the speaker verification method according to any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speaker verification method, comprising:

2. The method of claim 1, wherein each layer perceptual block of the multi-layer perceptual block comprises a first hybrid block and a second hybrid block;

3. The method of claim 1, wherein said performing speaker verification using said speaker embedding comprises:

4. The method of claim 2, wherein the first hybrid block and the second hybrid block are formed by: two dense layers, a residual connecting layer and a Gaussian error linear unit.

5. A speaker verification system, comprising:

6. The system of claim 5, wherein each of the multi-layer perceptual blocks comprises a first hybrid block and a second hybrid block;

the information determination program module is to:

7. The system of claim 5, wherein the speaker verification program module is to:

8. The system of claim 6, wherein the first hybrid block and the second hybrid block are formed by: two dense layers, a residual connecting layer and a Gaussian error linear unit.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.