CN113345444B - Speaker confirmation method and system - Google Patents

Speaker confirmation method and system Download PDF

Info

Publication number
CN113345444B
CN113345444B CN202110496856.2A CN202110496856A CN113345444B CN 113345444 B CN113345444 B CN 113345444B CN 202110496856 A CN202110496856 A CN 202110496856A CN 113345444 B CN113345444 B CN 113345444B
Authority
CN
China
Prior art keywords
speaker
nested
vector
residual
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110496856.2A
Other languages
Chinese (zh)
Other versions
CN113345444A (en
Inventor
陈增照
郑秋雨
何秀玲
戴志诚
张婧
孟秉恒
李佳文
吴潇楠
朱胜虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202110496856.2A priority Critical patent/CN113345444B/en
Publication of CN113345444A publication Critical patent/CN113345444A/en
Application granted granted Critical
Publication of CN113345444B publication Critical patent/CN113345444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a speaker confirmation method and a speaker confirmation system, which comprise the following steps: preprocessing audio information of a speaker, and converting the audio information into data in a preset format; inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired. The invention provides a deep nested residual error neural network based on a space attention mechanism, which can more accurately extract the voice print characteristics of a speaker.

Description

Speaker confirmation method and system
Technical Field
The invention belongs to the field of speaker recognition, and particularly relates to a speaker confirmation method and a speaker confirmation system.
Background
Voiceprints are a general term for the speech features that are implied in speech and that characterize and flag a speaker, and for speech models that are built based on these features. Language is a very important medium in interpersonal communication. The voice of a person is difficult to change after the person grows up, and because of different pronunciation organs and pronunciation modes, the voice characteristics of each person in the speaking process are unique, and the most basic pronunciation characteristics and vocal tract characteristics of a speaker are difficult to change even if the voice characteristics are simulated. Therefore, according to the unique identification and the short-time stationarity of the voiceprint, the method and the device can establish a model for the voice and use the model for identity authentication.
Voiceprint recognition, also called speaker recognition, is a process of determining a target speaker from voiceprint characteristics of the target speaker to be recognized. Speaker recognition tasks can be fundamentally divided into two categories: speaker identification and speaker confirmation, wherein the speaker identification means that a voiceprint sample is given, whether the identity of a target speaker is matched from a trained corpus is a one-to-one confirmation problem; speaker verification is a sample for identifying which speaker in the corpus the identity of the target voiceprint belongs to, and is a many-to-one selection problem.
Although many researchers at home and abroad have proposed a series of mature speaker confirmation methods such as a GMM-UBM gaussian mixture-general background model, an LSTM long-time memory network model, a CNN convolutional neural network model, and the like, many problems and improvement spaces still exist in the aspects of research methods, model effects and application scenarios.
The defects of the existing speaker identification method mainly comprise the following points: 1. the traditional GMM-UBM Gaussian mixture-general background model-based method is multiple, and the technical means still remain in the most traditional model matching stage. With the rapid development of society advancing technology, the traditional method represents complexity and difficulty in processing a large amount of data information. 2. The voiceprint of the speaker contains much valid speaker information, and the speaker audio is highly susceptible to a variety of factors including the environment, the recording device, and the speaker itself. The existing deep learning speaker verification algorithm cannot extract the speaker characteristics better, and the overall effect is to be improved. 3. The speaker identification method for different scenes and different languages is the challenge of voiceprint identification at present, the display-stored speaker identification method cannot reduce the influence of cross-language data on the model identification effect to the minimum, and the effect is not ideal in the cross-language identification research.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a speaker verification method and a speaker verification system, and aims to solve the problems of low recognition accuracy and large network scale in the existing speaker recognition neural network system.
To achieve the above object, in a first aspect, the present invention provides a speaker verification method, including the steps of:
preprocessing audio information of a speaker, and converting the audio information into data in a preset format;
inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.
In an optional example, the preprocessing is performed on the audio information of the speaker, and the audio information is converted into data in a preset format, specifically:
converting WAV format audio files of speakers into flac format files by adopting an audio conversion technology, and preprocessing the flac format files to obtain npy format data containing all information of the speakers.
In an optional example, each of the nested residual blocks includes two sub-residual blocks, each of the sub-residual blocks includes two cells, and each of the cells is a building block; placing a convolution layer in front of each two nested residual blocks;
the two nested sub-residual blocks realize the stacking function, and the specific formula is as follows:
H 1 (x)=F 1 (x)+x
H 2 (x)=F 2 (x)+H 1 (x)
H(x)=H 2 (x)+x
where x represents the input data of the first nested residual block, F 1 (x) Representing the output of the first sub-residual block in the nested residual block, H 1 (x) Is represented by F 1 (x) And x, F 2 (x) Representing the output of the second sub-residual block in the nested residual block, H 2 (x) Is represented by F 2 (x) And H 1 (x) H (x) represents the output of two nested residual blocks.
In an alternative example, a spatial attention mechanism is introduced after nesting the residual block, and a sigmoid function is used in the activation layer of the attention module to obtain a speaker vector at the frame level, with the following specific formula:
F″=f{avg_pool(V),max_pool(V)}
F′=σ(F″)
F=Multiply(V,F′)
v represents a speaker vector output by a nested residual error neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents merging results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a speaker vector at the frame level.
In an optional example, the calculating a cosine similarity between the speaker vector at the utterance level and the target speaker vector to determine whether the speaker is the target speaker specifically includes:
setting a threshold value for the probability value of the cosine similarity, and judging that the speaker is a target speaker when the probability value of the cosine similarity is greater than the threshold value, otherwise, judging that the speaker is not the target speaker.
In a second aspect, the present invention provides a speaker verification system, comprising:
the speaker audio determining unit is used for preprocessing audio information of a speaker and converting the audio information into data in a preset format;
the frame level vector determining unit is used for inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism so as to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the space attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
a speaker confirmation unit for generating a speaker vector at an utterance level based on the speaker vector at the frame level, and calculating a cosine similarity between the speaker vector at the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.
In an optional example, the speaker audio determining unit converts the WAV format audio file of the speaker into a flac format file by using an audio conversion technology, and pre-processes the flac format file to obtain npy format data containing all information of the speaker.
In an alternative example, each nested residual block comprises two sub-residual blocks, each sub-residual block comprises two cells, and each cell is a building block; a convolution layer is arranged in front of each two nested residual blocks;
the two nested sub-residual blocks realize the stacking function, and the specific formula is as follows:
H 1 (x)=F 1 (x)+x
H 2 (x)=F 2 (x)+H 1 (x)
H(x)=H 2 (x)+x
where x represents the input data of the first nested residual block, F 1 (x) Representing the output of the first sub-residual block in the nested residual block, H 1 (x) Is represented by F 1 (x) And x, F 2 (x) Representing the output of the second sub-residual block in the nested residual block, H 2 (x) Is represented by F 2 (x) And H 1 (x) H (x) represents the output of two nested residual blocks.
In an alternative example, the frame-level vector determination unit introduces a spatial attention mechanism after nesting the residual block, and uses a sigmoid function in the activation layer of the attention module to obtain the speaker vector at the frame level, with the following specific formula:
F″=f{avg_pool(V),max_pool(V)}
F′=σ(F″)
F=Multiply(V,F′)
wherein, V represents a speaker vector output by a nested residual neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents combining the results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a speaker vector at a frame level.
In an optional example, the speaker determination unit sets a threshold to the probability value of the cosine similarity, and determines that the speaker is the target speaker when the probability value of the cosine similarity is greater than the threshold, otherwise determines that the speaker is not the target speaker.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
the invention provides a speaker identification method and a speaker identification system, and provides a novel nested residual error neural network. After demonstrating the efficiency of ResNet in speaker verification, we conducted a series of explorations on this study. The concept of using the residual of the residuals is presented herein for the first time; the present invention proposes the use of a spatial attention mechanism in the speaker verification task. The invention adopts a method of adding an attention mechanism to obtain more valuable voiceprint information from a voiceprint energy spectrum, which is to use the spatial attention for picture processing for the processing of the voiceprint information for the first time. The invention provides a speaker confirmation network model with better performance, and experiments prove that the method is superior to the prior art, and then an attention mechanism is added on the basis of the network model to further improve the system performance.
The invention provides a speaker confirmation method and a speaker confirmation system, and provides a deep nested residual error neural network based on a space attention mechanism, wherein the deep neural network is used for more accurately extracting the voiceprint characteristics of a speaker. Experimental results show that the method has high learning capacity and can exceed the performance of the traditional neural network method. Compared with other latest methods, the accuracy of the experimental result on the English public data set LibriSpeech is improved by 6%, and the error rate is also low. The invention also proves that the method has good cross-language adaptability through the training performance on the Chinese data set AISHELL.
Drawings
Fig. 1 is a flow chart of a speaker verification method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speaker verification provided by an embodiment of the present invention;
FIG. 3 is a diagram of nested residual blocks provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a nested residual neural network based on an attention mechanism provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a speaker spatial attention mechanism provided by an embodiment of the present invention;
FIG. 6 is a representation of different models provided by embodiments of the present invention on an English data set;
FIG. 7 is a representation of different models on a Chinese dataset according to an embodiment of the present invention;
FIG. 8 is a comparison of performance on different data sets before and after an attention mechanism is added as provided by an embodiment of the present invention;
FIG. 9 is a graph showing the comparison of the representation of different models on Chinese and English cross-language data sets according to an embodiment of the present invention;
fig. 10 is an architecture diagram of a speaker verification system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention aims to extract the fbank voiceprint feature of a speaker based on a spatial attention mechanism and a deep nested residual error neural network, in the training process, the fbank voiceprint feature is input in a sample pair mode and then is represented according to an output speaker frame level vector of the last layer of the deep neural network, and whether the current speaker is a target speaker is judged by utilizing cosine similarity.
Fig. 1 is a flowchart of a speaker verification method according to an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:
s101, preprocessing audio information of a speaker, and converting the audio information into data in a preset format;
s102, inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
s103, generating a speaker vector at an utterance level based on the speaker vector at the frame level, and calculating the cosine similarity between the speaker vector at the utterance level and a target speaker vector to judge whether the speaker is a target speaker; the target speaker vector is pre-acquired.
The speaker identification method based on the spatial attention mechanism and the deep nested residual error neural network is divided into three parts: audio preprocessing, speaker voiceprint feature extraction and speaker confirmation; the overall process flow is shown in figure 2. Firstly, processing WAV audio files with speaker labels in a public data set Librispeech into flac files by adopting an audio conversion technology, and preprocessing the flac files to obtain npy files containing all speaker information. The npy file is then used as an input to the deep nested residual neural network, resulting in a frame-level vector representation of the speaker through a series of convolution and pooling operations. In the training process, the training is carried out in the form of AP and AN sample pairs, wherein AP represents anchor-positive samples and AN represents anchor-negative samples. The deep nested neural network improves the performance of the model on a speaker verification algorithm by continuously learning the vector similarity in the AP and the AN.
The invention proposes a nested residual block suitable for the speaker verification task. The detailed network structure is shown in fig. 3. Relu represents a linear rectification function, which is a commonly used activation function; the NR-block is a structure of a nested residual block proposed in the present invention.
Two nested residual blocks are stacked in the system structure of the invention, and each nested residual block is formed by stacking two sub-residual blocks. Each sub-residual block contains two cells, each cell being a building block. The present invention places one convolutional layer in front of every two nested residual blocks, sets the size of the kernel to 5, and sets the step size to 2, which can perfectly cope with the variation caused by the increase of the number of channels.
In a nested residual neural network, H (x) is considered to be the basic mapping fitted to the overlay, where x represents the input of the first layer in the nested residual block. In the first building block, the invention assumes that the residual function is equation (1), then equation (2) can be obtained in the second residual block. Therefore, the present invention can implement the stacking function of the nested residual block, which can be used as the input of the next nested residual block. Changing the objective function from H (x) to F (x) can greatly reduce the difficulty of network learning.
H 1 (x)=F 1 (x)+x (1)
H 2 (x)=F 2 (x)+H 1 (x) (2)
H(x)=H 2 (x)+x (3)
Where subscript 1 represents the first of the proposed nested residual blocks and subscript 2 represents the second of the proposed nested residual blocks.
Specifically, the nested residual error neural network is formed by stacking four identical network layers containing two nested residual error blocks, and each nested residual error block contains two residual error blocks. In the nested neural network, the output H (X) of the first nested residual block is used as the input of the second nested residual block, and the equations (1) to (3) are continuously executed, and so on, so as to obtain the output of the last nested residual block, that is, the output of the whole nested residual block in the nested residual neural network, which can be expressed as V, that is, the following speaker vector, and V is the speaker vector at the frame level.
The speaker vector of the nested residual neural network output from the frame-level feature extractor can be represented as: v = [ V ] 1 v 2 … v d ]. Wherein v is t ∈R D ,R D Belonging to a D-dimensional space vector. The symbol v denotes a vector corresponding to each frame, and the symbol d denotes an input frame size of the layer. It can be found by studying the present invention that the use of only a deep neural network may cause problems of gradient explosion and gradient disappearance. In transmitting voiceprint information in the form of a voiceprint energy spectrum, if the information is fully utilized, the workload of the system is undoubtedly increased and the method may be unreliable.
Therefore, fig. 4 is a diagram of a nested residual neural network structure based on an attention mechanism according to an embodiment of the present invention; FIG. 5 is a block diagram of a speaker spatial attention mechanism provided by an embodiment of the present invention; the present invention introduces a spatial attention mechanism after nesting the residual blocks. The present invention introduces average pooling and maximum pooling in the attention module based on spatial dimensions, and then merges the two together in equation (4), preserving useful information and reducing the size of the parameters. The dimensions of the features are then reduced to a single channel using a two-dimensional convolutional layer. A sigmoid function is used in the activation layer to obtain the spatial attention characteristics of the speaker. Finally, equation (6) adds the features and outputs with attention input to the feature module. To this end, the present invention obtains speaker frame-level feature vectors.
F″=f{avg_pool(V),max_pool(V)} (4)
F′=σ(F″) (5)
F=Multiply(V,F′) (6)
V represents a speaker vector output by a nested residual error neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents merging results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a frame-level speaker vector to be finally output, and is also a frame-level speaker vector to be used for cosine similarity determination later.
Specifically, frame-level speaker vectors are input into a dimension reduction layer and then averaged in the time dimension to generate utterance-level speaker vectors. The speaking level speaker vector is generated by the speaker vector of the frame level after dimension reduction, full connection layer, averaging and length normalization.
The fully-connected layer projects the utterance level representation into a 512-dimensional speaker vector. We normalize the features by a length normalization operation and use cosine similarity in the objective function:
cos(X,Y)=X T Y
where X and Y are two different speaker utterance level vectors. Then, the probability of the similarity of the X vector and the Y vector is modeled through a cosine similarity formula in an equation, and whether the X vector and the Y vector are the same speaker is judged by setting a certain threshold value for the probability of the similarity.
Four experiments are performed according to different corpora and whether attention is increased. The experimental comparison result shows that the method has higher learning ability and can exceed the performance of the prior neural network method.
In fig. 6 to 9, the ordinate ACC represents the accuracy, and the abscissa epoch represents the number of verification rounds.
Fig. 6 shows the experimental results of five different network models in the english common data set libristech. One of the experiments baseResNet served as a control experiment for the present method. On the basis of the comparison experiment, the invention respectively carries out fine adjustment on the network structure so as to observe the influence of different structures on the speaker confirmation task. The deployed-NResNet model in fig. 6 refers to the nested residual neural network model Proposed in the present invention. In the control experiment, the residual block was stacked three times, and ResNet-bd is an adjustment to the different locations of the residual block in the network. The invention nests and stacks two residual blocks in NResNet is one of the innovations. From experimental results, baseResNet, resNet-bd3 and Proposed-NResNet can achieve ideal performance, but clearly observe that the Proposed-NResNet model provided by the invention is more stable in performance and higher in accuracy.
Comprehensive and scientific research is the subject of consistent research. In fig. 7, the present invention selects the best three sets of experiments and then explores the results on the chinese public data set AISHELL. The deployed-NResNet model in fig. 7 refers to the nested residual neural network model Proposed by the present invention. The present invention takes the audio of 151 persons of AISHELL as a training set and the audio data of 40 persons as a verification set to verify that the method mentioned herein is also feasible in chinese. The data must first be processed into the required format for the neural network. As shown by the results in FIG. 7, the method of the present invention can achieve excellent results on both Chinese and English datasets.
Adding a speaker spatial attention mechanism in a deep nested neural network is a second innovation of this work. The invention compares the performance of adding a space attention mechanism to NResNet under different language environments through experiments. The present invention separately observes the effect of each group. The specific details may be analyzed in fig. 8. The deployed-NResNet + SA model in fig. 8 refers to a combined model of the nested residual neural network and the attention of the speaker Proposed by the present invention. Surprisingly, the accuracy of the Proposed-NResNet + SA of the present invention can reach the best performance approaching 0.99. There is a significant advance over the lack of added attention, which further illustrates the high efficiency of the attention mechanism proposed herein in the speaker verification task.
Furthermore, the present invention compares the effect of the present method on cross-language datasets in FIG. 9. The deployed-NResNet model in fig. 9 refers to the nested residual neural network model Proposed by the present invention. In this experiment the invention used Train-clean-100 from librispech as the training set, while the audio of 40 persons from AISHELL chinese dataset was used as the validation set. Through data in the graph, the model trained on the English data set by the method provided by the invention can obtain satisfactory results on the Chinese data set.
Table 1 is a comparison table of the accuracy and the equal error rate of the four neural network models used as the control experiments and the two models proposed in the method of the present invention on the english data set libristech and the chinese data set AISHELL, respectively. As can be clearly seen from table 1, compared with the existing model, the nested residual error neural network deployed-NResNet provided by the invention has an improved accuracy rate of about 3% in english and chinese data sets, and has a lower equal error rate; compared with the method of singly using the nested residual error neural network, the method of combining the nested residual error neural network and the attention mechanism of the speaker, the accuracy rate of the nested residual error neural network and the NResNet + SA are improved by 3 percent, compared with the existing model, the whole nested residual error neural network is improved by 6 percent, and the equal error rate is reduced by nearly one time.
TABLE 1 accuracy and Iso-error rate table of model
Figure BDA0003054702210000121
Fig. 10 is an architecture diagram of a speaker verification system according to an embodiment of the present invention, as shown in fig. 10, including:
a speaker audio determining unit 1010, configured to pre-process audio information of a speaker, and convert the audio information into data in a preset format;
a frame level vector determining unit 1020, configured to input data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a spatial attention mechanism, so as to obtain a frame level speaker vector; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
a speaker determining unit 1030 configured to generate a speaker vector at an utterance level based on the speaker vector at the frame level, and calculate a cosine similarity between the speaker vector at the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.
Specifically, the functions of each unit in fig. 10 can be referred to the description in the foregoing method embodiment, and are not described herein again.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A speaker verification method, comprising the steps of:
preprocessing audio information of a speaker, and converting the audio information into data in a preset format;
inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.
2. The speaker verification method according to claim 1, wherein the audio information of the speaker is preprocessed to convert the audio information into data in a preset format, specifically:
converting the WAV format audio file of the speaker into a flac format file by adopting an audio conversion technology, and preprocessing the flac format file to obtain npy format data containing all information of the speaker.
3. The speaker verification method of claim 1, wherein each nested residual block comprises two sub-residual blocks, each sub-residual block comprises two cells, each cell being a building block; placing a convolution layer in front of each two nested residual blocks;
the two nested sub-residual blocks realize the stacking function, and the specific formula is as follows:
H 1 (x)=F 1 (x)+x
H 2 (x)=F 2 (x)+H 1 (x)
H(x)=H 2 (x)+x
where x represents the input data of the first nested residual block, F 1 (x) Representing the output of the first sub-residual block in the nested residual block, H 1 (x) Is represented by F 1 (x) And x, F 2 (x) Representing the output of the second sub-residual block in the nested residual block, H 2 (x) Is represented by F 2 (x) And H 1 (x) H (x) represents the output of two nested residual blocks.
4. The speaker verification method according to claim 1, wherein a spatial attention mechanism is introduced after the residual block is nested, and a sigmoid function is used in the active layer of the attention module to obtain the speaker vector at the frame level, with the following formula:
F″=f{avg_pool(V),max_pool(V)}
F′=σ(F″)
F=Multiply(V,F′)
wherein, V represents a speaker vector output by a nested residual neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents combining the results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a speaker vector at the frame level, and Multiply denotes an addition-multiplication operation.
5. The speaker verification method according to any one of claims 1 to 4, wherein the calculating a cosine similarity between the utterance-level speaker vector and a target speaker vector to determine whether the speaker is a target speaker comprises:
setting a threshold value for the probability value of the cosine similarity, and judging that the speaker is a target speaker when the probability value of the cosine similarity is greater than the threshold value, otherwise, judging that the speaker is not the target speaker.
6. A speaker verification system, comprising:
the speaker audio determining unit is used for preprocessing audio information of a speaker and converting the audio information into data in a preset format;
the frame level vector determining unit is used for inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism so as to obtain a frame level speaker vector; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;
a speaker confirmation unit for generating a speaker vector at an utterance level based on the speaker vector at the frame level, and calculating a cosine similarity between the speaker vector at the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.
7. The speaker verification system according to claim 6, wherein the speaker audio determination unit converts the WAV format audio file of the speaker into a flac format file by using an audio conversion technique, and preprocesses the flac format file to obtain npy format data containing all the information of the speaker.
8. The speaker verification system of claim 6, wherein each nested residual block comprises two sub-residual blocks, each sub-residual block comprising two cells, each cell being a building block; placing a convolution layer in front of each two nested residual blocks;
the two nested sub-residual blocks realize the stacking function, and the specific formula is as follows:
H 1 (x)=F 1 (x)+x
H 2 (x)=F 2 (x)+H 1 (x)
H(x)=H 2 (x)+x
where x represents the input data of the first nested residual block, F 1 (x) Representing the output of the first sub-residual block in the nested residual block, H 1 (x) Is shown as F 1 (x) And x, F 2 (x) Representing the output of the second sub-residual block in the nested residual block, H 2 (x) Is represented by F 2 (x) And H 1 (x) H (x) represents the output of two nested residual blocks.
9. The speaker verification system of claim 6, wherein the frame-level vector determination unit introduces a spatial attention mechanism after nesting the residual block and uses a sigmoid function in the active layer of the attention module to obtain the frame-level speaker vector by the following formula:
F″=f{avg_pool(V),max_pool(V)}
F′=σ(F″)
F=Multiply(V,F′)
v represents a speaker vector output by a nested residual error neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents merging results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a speaker vector at the frame level, and Multiply denotes an addition-multiplication operation.
10. The speaker identification system according to any one of claims 6 to 9, wherein the speaker identification unit sets a threshold value to the probability value of the cosine similarity, and determines that the speaker is a target speaker when the probability value of the cosine similarity is greater than the threshold value, and otherwise determines that the speaker is not a target speaker.
CN202110496856.2A 2021-05-07 2021-05-07 Speaker confirmation method and system Active CN113345444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110496856.2A CN113345444B (en) 2021-05-07 2021-05-07 Speaker confirmation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110496856.2A CN113345444B (en) 2021-05-07 2021-05-07 Speaker confirmation method and system

Publications (2)

Publication Number Publication Date
CN113345444A CN113345444A (en) 2021-09-03
CN113345444B true CN113345444B (en) 2022-10-28

Family

ID=77469818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110496856.2A Active CN113345444B (en) 2021-05-07 2021-05-07 Speaker confirmation method and system

Country Status (1)

Country Link
CN (1) CN113345444B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115050373B (en) * 2022-04-29 2024-09-06 思必驰科技股份有限公司 Dual path embedded learning method, electronic device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570869A (en) * 2019-08-09 2019-12-13 科大讯飞股份有限公司 Voiceprint recognition method, device, equipment and storage medium
WO2020073694A1 (en) * 2018-10-10 2020-04-16 腾讯科技(深圳)有限公司 Voiceprint identification method, model training method and server
CN111179196A (en) * 2019-12-28 2020-05-19 杭州电子科技大学 Multi-resolution depth network image highlight removing method based on divide-and-conquer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11276410B2 (en) * 2019-09-13 2022-03-15 Microsoft Technology Licensing, Llc Convolutional neural network with phonetic attention for speaker verification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020073694A1 (en) * 2018-10-10 2020-04-16 腾讯科技(深圳)有限公司 Voiceprint identification method, model training method and server
CN110570869A (en) * 2019-08-09 2019-12-13 科大讯飞股份有限公司 Voiceprint recognition method, device, equipment and storage medium
CN111179196A (en) * 2019-12-28 2020-05-19 杭州电子科技大学 Multi-resolution depth network image highlight removing method based on divide-and-conquer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于3A-RCNN网络的说话人识别研究;李建文;《电子技术与软件工程》;20200731;全文 *

Also Published As

Publication number Publication date
CN113345444A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN110600047B (en) Perceptual STARGAN-based multi-to-multi speaker conversion method
JP3412496B2 (en) Speaker adaptation device and speech recognition device
CN109272988B (en) Voice recognition method based on multi-path convolution neural network
CN113488058B (en) Voiceprint recognition method based on short voice
CN109584893B (en) VAE and i-vector based many-to-many voice conversion system under non-parallel text condition
CN105261367B (en) A kind of method for distinguishing speek person
EP3486903B1 (en) Identity vector generating method, computer apparatus and computer readable storage medium
WO2019047343A1 (en) Voiceprint model training method, voice recognition method, device and equipment and medium
CN110047504B (en) Speaker identification method under identity vector x-vector linear transformation
Fang et al. Channel adversarial training for cross-channel text-independent speaker recognition
CN115206293B (en) Multi-task air traffic control voice recognition method and device based on pre-training
CN106601258A (en) Speaker identification method capable of information channel compensation based on improved LSDA algorithm
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Song et al. Triplet network with attention for speaker diarization
CN113345444B (en) Speaker confirmation method and system
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
Cumani et al. Speaker recognition using e–vectors
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN104464738B (en) A kind of method for recognizing sound-groove towards Intelligent mobile equipment
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
CN110428841A (en) A kind of vocal print dynamic feature extraction method based on random length mean value
Ananthi et al. Speech recognition system and isolated word recognition based on Hidden Markov model (HMM) for Hearing Impaired
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant