CN117497003A - Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax - Google Patents
Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax Download PDFInfo
- Publication number
- CN117497003A CN117497003A CN202311622037.3A CN202311622037A CN117497003A CN 117497003 A CN117497003 A CN 117497003A CN 202311622037 A CN202311622037 A CN 202311622037A CN 117497003 A CN117497003 A CN 117497003A
- Authority
- CN
- China
- Prior art keywords
- voiceprint
- frame
- mel
- formula
- transformer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000007246 mechanism Effects 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000005070 sampling Methods 0.000 claims abstract description 4
- 238000001228 spectrum Methods 0.000 claims description 21
- 238000007792 addition Methods 0.000 claims description 12
- 238000009432 framing Methods 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000037433 frameshift Effects 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 35
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, which comprises the following steps: acquiring audio data of any power transformer in real time, and acquiring voiceprint data in a binary format of the audio data and sampling frequency; preprocessing voiceprint data in a binary format; extracting features of voiceprint data in a training set; improving the VGG19 model to obtain an improved VGG19 model; and training the improved VGG19 model by utilizing the voiceprint feature training set, inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voiceprint type of the power transformer. The improved VGG19 model has more distinguishing capability on the characteristics of each channel, and the distinguishing capability on the voiceprint of the transformer is enhanced; the information quantity of the transformer voiceprint samples is emphasized, the transformer voiceprint samples with different information quantities can be dynamically distinguished, and the recognition degree and accuracy of the improved VGG19 model to the transformer voiceprint signals can be improved.
Description
Technical Field
The invention relates to the technical field of power equipment, in particular to a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax.
Background
The power transformer is used as one of the most critical devices in the power system and bears the difficult tasks of voltage conversion, electric energy transmission and the like, so that the normal and stable operation of the transformer is ensured to have very important significance for the safe operation of the whole power system. When different faults occur in the transformer, different sound signals sent by the transformer contain very rich state information, and the running state of the transformer can be reflected to a great extent. With the rapid development of artificial intelligence, fault diagnosis of mechanical states of transformer equipment by utilizing a voiceprint recognition technology becomes a new research hot spot.
Algorithms including BP neural network, SVM support vector machine, CNN model and LSTM model can not meet the requirements of modern power system in terms of speed and accuracy, and under the challenge of multiple factors, the short plate of the traditional identification method is further highlighted, and the specific expression is: the voice print type of the transformer is mainly divided into a normal voice print signal and an abnormal voice print signal, wherein the normal voice print signal has the types of normal working voice, human voice, normal working voice, bird voice and the like, the normal working voice is the same as rain voice, the abnormal voice print signal has the types of breakdown, bias magnetic working condition, short-circuit impact, partial discharge and the like, and therefore the transformer can be seen to have excessive voice print types. However, the conventional recognition method mainly has the problems of low speed and low accuracy in recognition of the transformer voiceprint at present, on one hand, because the conventional recognition model usually corresponds to downsampling when features are extracted, the number of channels is greatly increased while the length and width dimensions of an image are reduced, and in many channels, a plurality of types of information are contained, and the correlation between some channels and the transformer voiceprint is high, some channels are low, and even some channels are almost not. Because of the channels with low correlation and no correlation with the transformer voiceprint, the accuracy of the recognition model for transformer voiceprint recognition is affected, and the recognition time is prolonged; on the other hand, the classification function part of the traditional recognition model usually takes pure samples as the premise that noise is not contained, but in practical application, the sample collection usually has difficulty in removing external noise, and the part of samples have negative influence on the training effect, so that the classification accuracy of the recognition model on the transformer voiceprints is reduced.
Disclosure of Invention
In order to solve the problems of low recognition speed and low recognition accuracy of the voiceprint of the transformer, the invention aims to provide a method for recognizing the voiceprint of the transformer based on a channel attention mechanism and AH-Softmax, which can rapidly and accurately recognize different types of voiceprint signals of the transformer.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, the method comprising the sequential steps of:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
The step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
The step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω (n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any K, and the discrete Fourier transform DFT shares N (K-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
The step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 maximum pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j Included angle with sample x, θ l,l Is w l Included angle with sample x, s= ||w j || ||x||,f(m),θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j is the total category number of the sample;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, a channel attention mechanism module is added on an original VGG19 model to obtain an improved VGG19 model, the improved VGG19 model has more distinguishing capability on the characteristics of each channel, the relation among the channels and the importance degree of the relation are obtained in training, and the distinguishing capability on the voiceprint of the transformer is enhanced; secondly, replacing the classification function Softmax of the original VGG19 model with a new classification function AH-Softmax, estimating a transformer voiceprint sample tag by using weight indication function distribution as a clue, emphasizing the information quantity of the transformer voiceprint sample, dynamically distinguishing the transformer voiceprint samples with different information quantities, definitely emphasizing the information vector in the transformer voiceprint sample, and simultaneously absorbing the discernability among different transformer voiceprint categories to guide the distinguishing feature learning, and improving the recognition degree and accuracy of the improved VGG19 model to the transformer voiceprint signals through two parts of improvement.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a network structure diagram of the improved VGG19 model.
Detailed Description
As shown in fig. 1, a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, the method comprises the following sequential steps:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
The step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
the pre-emphasis is used for adding a zero point to counteract the high-end spectrum amplitude drop caused by glottal pulse, so that the signal spectrum is flattened and the resonance amplitudes are close, only the influence in the sound channel is left in the voice, and the extracted characteristics are more in accordance with the model of the meta-sound channel. A first-order FIR filter is inserted, high frequency is improved, meanwhile, a low frequency part is attenuated, when some fundamental frequencies are assigned to be larger, interference of the fundamental frequencies to formant detection is reduced through pre-emphasis, and meanwhile, the dynamic range of a frequency spectrum is reduced.
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
The frame-to-frame smooth transition is achieved by windowing, and continuity is maintained, i.e., signal discontinuities that may be caused at both ends of each frame are eliminated. Because the truncation has frequency domain energy leakage, a window function is required to reduce the effects of the truncation. Each frame is thus brought into a window function, where the window function selects a hamming window.
The step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω (n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any k, and the discrete Fourier transform DFT shares N (N-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
The step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
as shown in fig. 2, the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 max pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers, and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j Included angle with sample x, θ l,l Is w l Included angle with sample x, s= ||w j || ||x||,f(m,θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j isTotal category number of samples;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
In summary, the channel attention mechanism module is added to the original VGG19 model to obtain an improved VGG19 model, the improved VGG19 model has more distinguishing capability on the characteristics of each channel, the relation among the channels and the importance degree thereof are obtained in training, and the distinguishing capability on the voiceprint of the transformer is enhanced; the classification function Softmax of the original VGG19 model is replaced by a new classification function AH-Softmax, the new classification function AH-Softmax utilizes weight indication function distribution as a clue to estimate the voice print sample labels of the transformer, the information quantity of the voice print samples of the transformer is emphasized, the voice print samples of the transformer with different information quantities can be dynamically distinguished, the information vectors in the voice print samples of the transformer are clearly emphasized, and meanwhile the discernability among voice print categories of different transformers is absorbed to guide the distinguishing characteristic learning of the voice print samples, and the degree and the accuracy of the improved VGG19 model on the recognition of voice print signals of the transformer can be improved through two parts of improvement.
Claims (4)
1. A transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax is characterized by comprising the following steps: the method comprises the following steps in sequence:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
2. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
3. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω9n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any k, and the discrete Fourier transform DFT shares N (N-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
4. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 maximum pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j And sample xIncluded angle theta l,l Is w l Included angle with sample x, s= ||w j ||||x||,f(m,θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j is the total category number of the sample;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311622037.3A CN117497003A (en) | 2023-11-30 | 2023-11-30 | Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311622037.3A CN117497003A (en) | 2023-11-30 | 2023-11-30 | Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117497003A true CN117497003A (en) | 2024-02-02 |
Family
ID=89684949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311622037.3A Pending CN117497003A (en) | 2023-11-30 | 2023-11-30 | Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117497003A (en) |
-
2023
- 2023-11-30 CN CN202311622037.3A patent/CN117497003A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107331384B (en) | Audio recognition method, device, computer equipment and storage medium | |
CN110797002B (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN112329914B (en) | Fault diagnosis method and device for buried transformer substation and electronic equipment | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN102436809B (en) | Network speech recognition method in English oral language machine examination system | |
CN109767776B (en) | Deception voice detection method based on dense neural network | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN103559879A (en) | Method and device for extracting acoustic features in language identification system | |
CN105321525A (en) | System and method for reducing VOIP (voice over internet protocol) communication resource overhead | |
CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
CN112735477B (en) | Voice emotion analysis method and device | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN115758082A (en) | Fault diagnosis method for rail transit transformer | |
CN111883181A (en) | Audio detection method and device, storage medium and electronic device | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
Hassan et al. | Pattern classification in recognizing Qalqalah Kubra pronuncation using multilayer perceptrons | |
CN114065809A (en) | Method and device for identifying abnormal sound of passenger car, electronic equipment and storage medium | |
Zhang et al. | Temporal Transformer Networks for Acoustic Scene Classification. | |
CN117789758A (en) | Urban audio classification method of convolutional neural network based on residual calculation | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN111341331B (en) | Voice enhancement method, device and medium based on local attention mechanism | |
CN112735466A (en) | Audio detection method and device | |
CN117310668A (en) | Underwater sound target identification method integrating attention mechanism and depth residual error shrinkage network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |