CN116863939A - Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED - Google Patents

Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED Download PDF

Info

Publication number
CN116863939A
CN116863939A CN202310826924.6A CN202310826924A CN116863939A CN 116863939 A CN116863939 A CN 116863939A CN 202310826924 A CN202310826924 A CN 202310826924A CN 116863939 A CN116863939 A CN 116863939A
Authority
CN
China
Prior art keywords
information
processing
matrix
spectrogram
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310826924.6A
Other languages
Chinese (zh)
Inventor
王鲁昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Wuzheng Information Technology Co ltd
Original Assignee
Jiangsu Wuzheng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Wuzheng Information Technology Co ltd filed Critical Jiangsu Wuzheng Information Technology Co ltd
Priority to CN202310826924.6A priority Critical patent/CN116863939A/en
Publication of CN116863939A publication Critical patent/CN116863939A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a voiceprint recognition method of DenseNet-LSTM-ED based on global attention mechanism, which belongs to the technical field of voiceprint recognition, and comprises the steps of firstly, dividing a voice signal, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping to obtain a corresponding spectrogram of the voice signal; then respectively processing the spectrograms by using a DenseNet module, an LSTM unit and an ED module, fusing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and processing the spectrograms by using the ED module to form enhancement information; and then, the space-time fusion information and the enhancement information are fused to form space-time enhancement information, different weights are given to the space-time enhancement information by using a global attention mechanism, so that the contribution degree of key frame voice to a recognition result is increased, and the classification of speaker recognition is realized by a mode of joint supervision of a Softmax loss function and a centrloss function.

Description

Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED
Technical Field
The application relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method based on DenseNet-LSTM-ED of a global attention mechanism.
Background
Speaker recognition, also known as voiceprint recognition, is a technique that recognizes and identifies the identity of a speaker by sound. Voiceprints are characteristic parameters extracted from speech that can characterize the identity of a speaker. Voiceprints are similar to biological features such as irises or DNA, and because physiological mechanisms such as nasal cavities and oral cavities of each person are different, different speakers have different pronunciation modes and pronunciation habits, and even through imitation, the essential features with the identity information of the speakers cannot be simulated. Thus, voiceprint recognition can distinguish people with different identities through the characteristics, so that the identity of a speaker can be judged. With the rapid development of deep learning technology, the voiceprint recognition under deep learning has stronger feature representation capability compared with the traditional voiceprint recognition method, and can extract abstract features with higher dimensionality from voice. In the current speaker recognition algorithm based on deep learning, most algorithms only consider the space characteristics or time domain characteristics of voice, and the model training difficulty is high and the accuracy is low.
Disclosure of Invention
Based on this, there is a need to address these technical problems, and the present application provides a voiceprint recognition method based on DenseNet-LSTM-ED of global attention mechanism, which includes the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: the information of the spectrogram obtained in the step S100 is copied and sent to an ED module, the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhancement information;
s500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total loss function by combining a Softmax loss function and a CenterLoss loss function, and identifying the category of the voiceprint by using the total loss function.
Further, the step 200 includes: the DenseNet module comprises: 1 initial convolution, N Dense connection modules, dense Block, multiple transport layer transitions, the Dense connection modules Dense Block comprising x0, x1, xl-1, xl; x0, x1,) the term xl-1, xl is layer 0, layer 1, feature map of layer i, and the feature map of layer i is spliced by the feature maps of each layer, and the spliced feature information H is obtained through nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor, andis not an integer.
Further, the step 400 includes: step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,the trend information of the pixel point position coordinates (i, g), x (i, g) is the original information of the pixel point position coordinates (i, g), x (i, g+j) is the original information of the pixel point position coordinates (i, g+j), j epsilon (-k, k), and k is a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,for the equilibrium parameter of the information matrix obtained after deconvolution processing being Ot, r is a trend scaling factor, and is used for controlling the size of the information matrix Dt obtained by trend information processing;
step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
Further, the step 500 includes:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
Further, the space-time fusion information and the enhancement information are added to obtain space-time enhancement information M, wherein the space-time enhancement information M is formed by a plurality of sub-space enhancement information M1, M2, mt. with the same dimension, wherein t is a positive integer, s is a positive integer, a global attention mechanism is provided, and different weights are given to the space-time enhancement information M, and the specific process is as follows:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t
Further, the sub-space enhancement information m is added t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function;
finally, normalizing the attention weight value by using a softmax classification layer:
where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
Further, the step S500 includes: the total loss function is as follows:
wherein L represents the total loss function, L s Represent Sofimax loss function, L c Represents the Center Loss function, λ is a factor used to balance the 2 Loss functions.
Further, the Sofimax loss function is as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2);and W is equal to j Respectively represent x i Is distinguished as y i Class and j-th class weight vectors; b yi And bj represents y respectively i Class and j bias term; m represents the size of the small lot.
Further, the Center Loss function is as follows:
wherein ,Cyi Represents the y i Class center of class feature, X i Representing the ith feature, m represents the size of the small lot.
The beneficial technical effects of the application are as follows:
(1) Dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal to obtain a corresponding spectrogram of the voice signal; then respectively processing the spectrograms by using a DenseNet module, an LSTM unit and an ED module, fusing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and processing the spectrograms by using the ED module to form enhancement information; then, the space-time fusion information and the enhancement information are fused to form space-time enhancement information, and different weights are given to the space-time enhancement information by using a global attention mechanism so as to increase the contribution degree of key frame voice to the recognition result; finally, the classification of speaker identification is realized by means of joint supervision of the Softmax loss function and the centrloss function.
(2) The spectrogram is used as an input form of a voice signal, the space characteristic information and time sequence characteristic information which are contained in the spectrogram and change along with time are reserved, the space characteristic information is taken into consideration, and the richness of the input information is ensured; according to the application, the DenseNet module, the LSTM unit and the ED module are respectively used for processing information input from a spectrogram, so that the original sequence characteristic of voice is fully reflected by the characteristic reusability and the characteristic of convolution accumulation, the LSTM unit controls the flow of information by a gating mechanism, the sequential characteristic between the front unit and the rear unit is greatly memorized, one part of information is transmitted to the three modules by adopting a parallel connection mode to respectively excavate characteristic information, the side emphasis of the information is different by the three modules, the multi-dimensional information characteristic excavation is realized, the characteristic extraction and combination of different dimensions are realized on voice signals, and the recognition accuracy is improved.
(3) Introducing an attention mechanism and improving the loss function. The method comprises the steps of fusing local spatial features and time features extracted through two networks and reinforcing information, distributing attention to the fused information in a probability distribution weight mode, focusing attention on effective information, training in a mode of jointly supervising a Softmax Loss function and a Center Loss function, expanding compactness in a class and separability between classes, and effectively improving identification accuracy.
Drawings
Fig. 1 is a diagram of a spectrogram generation process.
Fig. 2 is a block diagram of the DenseNet module.
Fig. 3 is a Block diagram of the Dense Block module.
Fig. 4 is a structural diagram of the LSTM cell.
FIG. 5 is a flow chart of a method for voiceprint recognition based on DenseNet-LSTM-ED by a global attention mechanism.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the disclosed embodiments and are not intended to limit the disclosed embodiments.
The structures, proportions, sizes, etc. shown in the drawings are shown only in connection with the present disclosure, and are not intended to limit the scope of the application, since any modification, variation in proportions, or adjustment of the size, etc. of the structures, proportions, etc. should be considered as falling within the spirit and scope of the application, without affecting the effect or achievement of the objective. Also, the terms "and," or "and the like recited in the present specification are for convenience of description only and are not intended to limit the scope of the application, which is defined by the terms" and "or" and their relative relationships, without any substantial change to the technical content, as they are also considered to be within the scope of the application which is practicable; in addition, the embodiments of the present application are not independent of each other, but may be combined.
The speech spectrum is a two-dimensional image formed by integrating the frequency domain characteristics of the speech signal over the time domain, and dynamically displays the relationship between the speech spectrum and time. The spectrogram comprises spatial characteristic information formed by corresponding time frequency and energy intensity and time sequence characteristic information changing along with time, different textures are formed according to the color depth, a large amount of personal characteristic information of speakers is contained in the textures, and different speakers can be identified according to the difference of the spectrogram textures.
In the generation process of the spectrogram shown in fig. 1, a voice signal is divided into a plurality of frame signals according to the short-time stable characteristic of the voice signal, each frame signal is windowed, fourier transformation is performed to obtain the amplitude-frequency characteristic of the signal, then energy spectrum density is calculated, and the spectrogram corresponding to the voice signal is obtained by performing logarithmic transformation and color mapping on the energy spectrum density.
Referring to the image processing method based on deep learning, a spectrogram of a speaker is generated according to the method shown in fig. 1, and the spectrogram is used as original data and is input into a DenseNet module for spatial information extraction. As shown in fig. 2, the DenseNet module includes 1 initial convolution, N Dense connection modules, a Dense Block, and a plurality of transport layer transitions. The DenseNet module of the L layers comprises L (L+1)/2 connections, each connection outputs 1 layer of characteristic diagrams, L different layers of characteristic diagrams are obtained, and the characteristic diagrams of the different layers contain different spatial characteristic information.
As shown in fig. 3, the Dense Block includes x0, x1, &..; x0, x1, & gt, wherein xl-1, xl is the 0 th layer, the 1 st layer, and the characteristic diagram of the first layer, the characteristic diagram of the first layer is obtained by splicing the characteristic diagrams of each layer and then by nonlinear transformation Hl (x), and spliced characteristic information H is obtained by nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor andis not an integer.
The nonlinear transformation H (x) consists of a1×1 convolution and a 3×3 convolution, the 1×1 convolution is called a bottleneck Layer (Bottle-negk Layer), the number of output channels is 4K1, K1 is a super parameter, called a growth rate, the function of the nonlinear transformation H (x) is to fuse the characteristics of each channel, and the number of characteristic graphs input to the 3×3 convolution is reduced by dimension reduction, so that the calculation amount of a network is reduced.
In order to prevent the feature dimension from growing too fast with the increase of the network layer number, the Dense Block modules are connected through the transmission layer Transition. Assuming that the number of input channels of the transmission layer is K2, compressing the number of input channels K2 by adopting 1X 1 convolution, enabling the transmission layer to play a role in model compression, reducing the size of a feature map through 2X 2 pooling operation, and reducing the number of parameters of a network.
When DenseNet is applied to voiceprint recognition to process a spectrogram, the spectrogram can be regarded as a texture image, the voice personality characteristics of a speaker are reflected on the space geometrical characteristics and time sequence characteristics among pixels, and when the DenseNet is utilized to carry out voiceprint recognition, the characteristic reuse characteristic and the characteristic accumulated on a convolution pooling layer are utilized, so that the original sequence characteristic of the voice can be fully reflected. However, as the convolutional pooling layer increases, the data volume of the network increases, and long-term dependence problems can occur when the bottleneck layer and the transmission time span of DenseNet are long, so that gradient explosion or disappearance is caused.
The LSTM unit is also called an LSTM network module or an LSTM network, and utilizes the gating concept to control the flow of information through a gating mechanism, so that the problem of gradient disappearance is relieved. The LSTM unit generally comprises an input gate, an output gate and a forget gate, wherein the output gate determines the output unit information; the forgetting gate decides the information to be forgotten, the structure of the LSTM unit is shown in fig. 4, and the LSTM network belongs to the prior art, and the contents thereof are not described in detail herein.
As shown in fig. 5, the spectrogram in fig. 1 is input as original data to an ED module, where the ED module includes deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvoluted information and the trend information processing information, and convolution processing of the fused deconvoluted information and the trend information processing information to generate enhanced information. Deconvolution processing is used for carrying out deconvolution matrix perspective on the spectrogram information to obtain high-dimensional information and main features behind the spectrogram information, so that feature learning of different scales is realized; the trend information processing is used for learning the information of the spectrogram and analyzing trend information contained in the front and back of the content. In some embodiments, trend information contained in the spectrogram is calculated by mean square error. The convolution processing is to perform comprehensive feature extraction on the information after the deconvolution processing and the information after the trend information processing, and process and generate enhanced information. The convolution processing can avoid the occurrence of noise introduced when only deconvolution is adopted to process the spectrogram information and the occurrence of information loss when only trend information is adopted to process the spectrogram information.
Because the influence of information at different moments on the state of the current moment has different specific gravity of input voice, the information which is too long does not have great influence on the current moment by the information at the previous moment, and the model needs to evaluate the importance of the output generated at different moments, a global attention mechanism is introduced. The attention mechanism is an automatic weighting mechanism, which can link different modules in a weighted form, so that a model is forced to learn to concentrate attention on a specific part of an input sequence, namely, more attention is allocated to a key part of things, and a specific area is allocated with larger weight through calculation of attention probability distribution.
Therefore, the application provides a voiceprint recognition method based on a global attention mechanism, which comprises the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial sequence information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: and (3) copying the information of the spectrogram obtained in the step (S100) and sending the information to an ED module, wherein the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhanced information.
Specifically, the step 400 includes:
step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,is trend information of pixel point position coordinates (i, g), x (i,g) X is the original information of pixel point location coordinates (i, g) (i,g+j) The original information of pixel point position coordinates (i, g+j), j epsilon (-k, k), k being a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,and r is a trend scaling factor for controlling the size of the information matrix Dt obtained by the trend information processing.
Step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
S500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total Loss function by combining a Softmax Loss function and a Center Loss function, and identifying the category of the voiceprint by using the total Loss function.
Specifically, step S500 includes the following specific steps:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
In some embodiments, the spatio-temporal fusion information is a spatio-temporal fusion information matrix, the enhancement information is an enhancement information matrix, the spatio-temporal fusion information and the enhancement information are information fused to form spatio-temporal enhancement information, specifically, the spatio-temporal fusion information matrix and a matrix of corresponding dimensions of the enhancement information ED matrix are added to obtain spatio-temporal enhancement information M, the spatio-temporal enhancement information M is composed of a plurality of sub-spatio-temporal enhancement information M1, M2, a.i., mt., ms, t is a positive integer, s is a positive integer, the application provides a global attention mechanism, and the spatio-temporal enhancement information M is given different weights, and the specific processes are as follows:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t
Sub-space enhancement information m t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function.
Finally, the attention weight values were normalized using the softmax classification layer.
Where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
In some embodiments, to improve the characterization capability of the voiceprint feature and make the feature have good intra-class compactness and inter-class separability, the application trains two models by adopting a mode of jointly supervising the Softmax Loss function and the Center Loss function (namely, the Center Loss function), and the specific expression of the total Loss function L is as follows:
wherein L represents the total loss function, L s Representing a Softmax loss function, L c Representing the Center Loss function, λ is a factor for balancing the 2 Loss functions, i.e. representing which Loss function is in what proportion, the value of λ ranges from 0 to 1, and taking the square root of the total Loss function is advantageous for reducing the error value of the miscalculated Loss. Wherein the Softmax loss function is expressed as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2);and W is equal to j Respectively represent x i Is distinguished as y i Weight vectors of class and j-th class, i.e. y-th of weight W in the last fully connected layer i And column j; b yi And bj represents y respectively i Class and j bias term; m is the size of the small lot;
the Center Loss function is as follows:
wherein ,Cyi Represents the y i The class center of the class features can be seen, and the center loss function provides a class center for each class, so that each sample participating in training can be close to the center of the same class, and the clustering effect is achieved.
Networks trained under the supervision of Softmax loss functions can be classified into different categories, but the compactness of the features within the category has not been considered; the center loss function, while minimizing intra-class distances, does not take into account the separability between classes. Therefore, the algorithm is optimized by adopting a mode of combining the two, so that the compactness in the class and the separability between the classes can be enlarged, and the high-precision recognition of voiceprints can be realized.
Experimental data set: the voice data set adopted in the experiment is from a Hill Shell Chinese Mandarin AISSEL-ASR 009OS1 open source voice database, 400 speakers participate in recording, the recording process is carried out in a quiet indoor environment respectively from different mouth voice areas of China, each speaker records more than 300 voice fragments, the voices of the same speaker are placed under a folder, 10 voices are randomly extracted for the experiment, one voice is intercepted into voice fragments with the duration of 1.5 seconds, a training set comprises 41909 voice spectrograms, and a test set comprises 10472 voice spectrograms. Experimental environment: hardware platform: GPU: NVIDIA GTX 1080; RAM:32G; and (3) video memory: 16G; operating system: windows experiments are based on the pythorch framework. The voice print recognition method based on the DenseNet-LSTM-ED of the global attention mechanism comprises the following specific experimental procedures:
step S1: the speech signal is divided, windowed, fourier transformed, energy spectral density (or energy density spectrum), logarithmic transformed, and color mapped to obtain a corresponding spectrogram of the speech signal. The spectrogram comprises different time frequencies and voice data energy information thereof, the time sequence characteristic sequence of the spectrogram is used for representing different voice contents, and the personality characteristic information is used for identifying different speakers.
Step S2: and (3) taking the spectrogram obtained in the step (S1) as input, and inputting the input into a DenseNet module for spatial feature extraction. DenseNet modules as a dense connection mechanism, each layer of which receives as its input the outputs of all the layers in front. The dense connection mechanism facilitates gradient back propagation during the training process, a deeper network of feature extraction at the training site, and feature reuse through dense connection dimension information, with a smaller amount of parameters compared to the residual (Resnet) network. The DenseNet module of the t layer comprises t (t+1)/2 connections, so that t characteristic diagrams of different layers are obtained, and the characteristic diagrams of the different layers contain rich spatial characteristic information.
Step S3: the characteristics of the spectrogram obtained in the step S1 are copied to be used as the input of an LSTM unit, and the LSTM unit determines time characteristic information which needs to be reserved in the previous state through a forgetting gate; the input gate determines the important piece of information that is currently about to be input. The output gate is used to determine hidden state information for the next time. After t LSTM units, the final output containing the time characteristic relation can be obtained, and t hidden states can be obtained at the same time.
Step S4: and (2) copying the features of the spectrogram obtained in the step (S1) to serve as input of an ED module, wherein the ED module comprises deconvolution processing of information of the spectrogram, trend information generation processing of the spectrogram, and convolution processing of fused deconvolution processing information and trend generated information to generate enhancement information OD.
Step S5: firstly, after the step S2, feature graphs of t different layers including space information are obtained; after the step S3, t hidden layer states are obtained through t LSTM units, and each hidden layer comprises important time characteristic information; splicing the information of the t hidden layer states and the information of the t feature maps to obtain space-time information, and adding the enhancement information OD of the ED module to obtain space-time enhancement information; in addition, the global attention mechanism is used for processing the obtained space-time enhancement information, attention is focused on effective information in a probability distribution weight mode, global attention weight distribution p of the space-time enhancement information is output, then a total Loss function is formed by combining a Softmax Loss function and a Center Loss function, the striving classification and recognition are carried out on voice prints of a speaker by utilizing the total Loss function, and the steps of multiple recognition are omitted. Clearly, training and classification are performed by a combined Softmax Loss function and Center Loss function supervision mode, and the individual Softmax functions can be used for classifying different classes, but the compactness of the characteristics in the classes is not considered; although the centering loss minimizes the intra-class distance, the separability between classes is not considered, and the compactness in the classes and the separability between classes can be enlarged by adopting a comprehensive mode of the two, so that the high-precision recognition of the voiceprints is realized.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, acts, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed herein may be alternated, altered, rearranged, disassembled, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The above examples merely represent a few implementations of the disclosed embodiments, which are described in more detail and are not to be construed as limiting the scope of the disclosed embodiments. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made to the disclosed embodiments without departing from the spirit of the disclosed embodiments. Accordingly, the scope of the disclosed embodiments should be determined from the following claims.

Claims (9)

1. A voiceprint recognition method based on a global attention mechanism, namely DenseNet-LSTM-ED, comprises the following steps:
s100: obtaining a spectrogram corresponding to the voice signal through dividing, windowing, fourier transforming, energy density spectrum, logarithmic transforming and color mapping the voice signal;
s200: taking the spectrogram obtained in the step S100 as input, inputting the input into a DenseNet module for spatial feature extraction, and obtaining the spatial information of the voice signal;
s300: the information of the spectrogram obtained in the step S100 is copied and sent to the LSTM units, and after t LSTM units are passed, the time sequence information of the voice signal is conveniently and fully extracted;
s400: the information of the spectrogram obtained in the step S100 is copied and sent to an ED module, the ED module comprises deconvolution processing of the information of the spectrogram, trend information processing of the information of the spectrogram, fusion of the deconvolution processed information and the trend information processed information, and convolution processing of the fused deconvolution processed information and the trend processed information to generate enhancement information;
s500: and splicing processing results of the DenseNet module and the LSTM unit to form space-time fusion information, carrying out information fusion on the space-time fusion information and enhancement information ED to form space-time enhancement information, giving different weights to the space-time enhancement information by using an attention mechanism, forming a total Loss function by combining a Softmax Loss function and a Center Loss function, and identifying the category of the voiceprint by using the total Loss function.
2. The method of claim 1, said step 200 comprising: the DenseNet module comprises: 1 initial convolution, N Dense connection modules, dense Block, multiple transport layer transitions, the Dense connection modules Dense Block comprising x0, x1, xl-1, xl; x0, x1,) the term xl-1, xl is layer 0, layer 1, feature map of layer i, and the feature map of layer i is spliced by the feature maps of each layer, and the spliced feature information H is obtained through nonlinear transformation Hl (x) l ([x 0 ,x 1 ,......x l-1 ]) Splicing characteristic information H l ([x 0 ,x 1 ,......x l-1 ]) Feature map x of the first layer is obtained through feature mapping of an activation function gamma (x) l ,x l The calculation formula is shown as follows:
x l =γ(H l ([x 0 ,x 1 ,......x l-1 ]))
wherein γ (x) represents the activation function, wherein λ 1 ,λ 2 Is a multiplier factor, andis not an integer.
3. The method of claim 1, the step 400 comprising: step S401: deconvolution processing is carried out on the information of the spectrogram;
Ot=s1*(a1-1)+k1-2*p1
wherein a1 is a matrix of spectrogram pixel points; s1 is the length of each shift of the convolution kernel; k1 is the size of the convolution kernel, and when the size of the convolution kernel is not matched with the size of the spectrogram matrix a1, p1 is the first filling matrix; when the size of the convolution kernel is matched with the size of the spectrogram matrix a1, p1 is 0; ot is the information matrix after deconvolution;
step S402: trend information processing is carried out on the information of the spectrogram;
in the matrix of the spectrogram pixel points, the trend information of the pixel point positions is obtained through numerical calculation of k periods near the position coordinates (i, g) of each pixel point, and the calculation formula is as follows:
wherein ,is trend information of pixel point position coordinates (i, g), x (i,g) X is the original information of pixel point location coordinates (i, g) (i,g+j) The original information of pixel point position coordinates (i, g+j), j epsilon (-k, k), k being a positive integer;
calculating the position of each pixel point of the spectrogram through the calculation formula to obtain an information matrix Dt after trend information processing of the spectrogram, wherein the formula of the Dt is as follows:
wherein n is a positive integer, and m is a positive integer;
step S403: fusing the information after deconvolution processing and the information after trend information processing;
the information matrix obtained after deconvolution processing is Ot, the information matrix obtained by trend information processing is Dt, and the information obtained after deconvolution processing and the information obtained after trend information processing are fused to form an information fusion matrix OD:
wherein ,for the equilibrium parameter of the information matrix obtained after deconvolution processing being Ot, r is a trend scaling factor, and is used for controlling the size of the information matrix Dt obtained by trend information processing;
step S404: the information matrix OD of the information after the fused deconvolution processing and the information after the trend processing are subjected to convolution processing to generate enhanced information, the information fusion matrix OD is used as input of the convolution processing, the information fusion matrix OD is subjected to feature extraction, and the calculation formula is as follows:
where ED is enhancement information, k2 is the size of the convolution kernel, s2 is the step of the convolution kernel moving, and when the size of the convolution kernel is not matched with the OD size of the information fusion matrix, p2 is the second filling matrix; when the size of the convolution kernel matches the size of the information fusion matrix OD, p2 is 0.
4. A method according to claim 3, said step 500 comprising:
s501: splicing the processing results of the DenseNet module and the LSTM unit to form space-time fusion information, and carrying out information fusion on the space-time fusion information and the enhancement information to form space-time enhancement information;
s502: the weight given by the attention mechanism is used for the space-time enhancement information, the category of the voiceprint is identified by utilizing key frame voice, and the category of the voiceprint is predicted;
s503: using the Softmax Loss function in combination with the Center Loss function to form a total Loss function, the total Loss function calculating a difference between a true value of the class of voiceprints and a predicted value of the predicted voiceprint class to obtain a Loss value;
s504: judging whether the loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process advances to step S505.
S505: giving new weight to the space-time enhancement information by using an attention mechanism, and identifying the category of the voiceprint again by utilizing key frame voice to predict the category of the voiceprint again;
s506: combining the Softmax Loss function with the Center Loss function again to form a total Loss function, and calculating the difference between the true value of the category of the voiceprint and the predicted value of the category of the voiceprint predicted again by the total Loss function to obtain a new Loss value;
s507: judging whether the new loss value is equal to a preset value, if so, completing the recognition of the voiceprint class; if not, the process goes to step S505 to step S507 again until the loss value is equal to the preset value, and the classification of the voiceprint is completed.
5. The method according to claim 4, wherein the spatiotemporal fusion information and the enhancement information are added to obtain spatiotemporal enhancement information M, wherein the spatiotemporal enhancement information M is composed of a plurality of sub-spatiotemporal enhancement information M1, M2 of the same dimension, wherein the sub-spatiotemporal enhancement information M1, M2 is a.m., mt., ms is a positive integer, and s is a positive integer, and the global attention mechanism is provided, and the spatiotemporal enhancement information M is given different weight treatments by the following steps:
computing sub-spatiotemporal enhancement information m t And sub-space enhancement information m i Similarity alpha ti ,α ti The calculation formula is as follows:
its corresponding scoring function is as follows:
then the vector C is obtained by means of weighted average t
6. The method of claim 5, wherein the sub-space enhancement information m t Sum vector C t Performing head-tail splicing processing, multiplying the processing by a weight matrix Wc of an attention mechanism, and calculating space-time enhancement information based on a global attention mechanism
Wherein tan h is a hyperbolic tangent function;
finally, the attention weight values were normalized using a softmax classification layer:
where p is the probability size, since normalization is to limit the value to 0-1]Between [0-1 ]]The numbers between the ranges may represent probabilities, wt isCorresponding feature weights.
7. The method of claim 1, the step 500 comprising: the total loss function is as follows:
wherein L represents the total loss function, L s Representing a Softmax loss function, L c Represents the centrloss function, λ is a factor used to balance the 2 loss functions.
8. The method of claim 7, softmax loss function is as follows:
wherein ,xi Representing the ith feature, y i Is x i Is a true category label of (2); w (W) yi And W is equal to j Respectively represent x i Is distinguished as y i Class and j-th class weight vectors; b yi And bj represents y respectively i Class and j bias term; m represents the size of the small lot.
9. The method of claim 8, wherein the Center Loss function is as follows:
wherein ,Cyi Represents the y i Class center, x of class features i Representing the ith feature, m represents the size of the small lot.
CN202310826924.6A 2023-07-07 2023-07-07 Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED Pending CN116863939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310826924.6A CN116863939A (en) 2023-07-07 2023-07-07 Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310826924.6A CN116863939A (en) 2023-07-07 2023-07-07 Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED

Publications (1)

Publication Number Publication Date
CN116863939A true CN116863939A (en) 2023-10-10

Family

ID=88226306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310826924.6A Pending CN116863939A (en) 2023-07-07 2023-07-07 Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED

Country Status (1)

Country Link
CN (1) CN116863939A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292209A (en) * 2023-11-27 2023-12-26 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN117598711A (en) * 2024-01-24 2024-02-27 中南大学 QRS complex detection method, device, equipment and medium for electrocardiosignal

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292209A (en) * 2023-11-27 2023-12-26 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN117292209B (en) * 2023-11-27 2024-04-05 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN117598711A (en) * 2024-01-24 2024-02-27 中南大学 QRS complex detection method, device, equipment and medium for electrocardiosignal
CN117598711B (en) * 2024-01-24 2024-04-26 中南大学 QRS complex detection method, device, equipment and medium for electrocardiosignal

Similar Documents

Publication Publication Date Title
CN116863939A (en) Voiceprint recognition method based on global attention mechanism and adopting DenseNet-LSTM-ED
CN108831485B (en) Speaker identification method based on spectrogram statistical characteristics
CN111754988B (en) Sound scene classification method based on attention mechanism and double-path depth residual error network
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN107492382A (en) Voiceprint extracting method and device based on neutral net
Li et al. Towards Discriminative Representation Learning for Speech Emotion Recognition.
CN112580782B (en) Channel-enhanced dual-attention generation countermeasure network and image generation method
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN110853680A (en) double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
Chetty Biometric liveness checking using multimodal fuzzy fusion
CN111583964A (en) Natural speech emotion recognition method based on multi-mode deep feature learning
Ballesteros et al. Deep4SNet: deep learning for fake speech classification
CN110047501B (en) Many-to-many voice conversion method based on beta-VAE
CN114330551A (en) Multi-modal emotion analysis method based on multi-task learning and attention layer fusion
Yuan et al. Evolving multi-resolution pooling CNN for monaural singing voice separation
CN110047504A (en) Method for distinguishing speek person under identity vector x-vector linear transformation
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
CN116386853A (en) Intelligent medical-oriented deep separable convolution dual-aggregation federal learning method
CN113887883A (en) Course teaching evaluation implementation method based on voice recognition technology
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Mohanty et al. Segment based emotion recognition using combined reduced features
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
WO2020098107A1 (en) Detection model-based emotions analysis method, apparatus and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication