CN117497003A - Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax - Google Patents

Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax Download PDF

Info

Publication number
CN117497003A
CN117497003A CN202311622037.3A CN202311622037A CN117497003A CN 117497003 A CN117497003 A CN 117497003A CN 202311622037 A CN202311622037 A CN 202311622037A CN 117497003 A CN117497003 A CN 117497003A
Authority
CN
China
Prior art keywords
voiceprint
frame
mel
formula
transformer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311622037.3A
Other languages
Chinese (zh)
Inventor
汪兆冉
黄文礼
杨建旭
张可
吴国元
韩俊宝
晏雨晴
侯仕杰
程晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Original Assignee
Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Nanrui Jiyuan Power Grid Technology Co ltd filed Critical Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Priority to CN202311622037.3A priority Critical patent/CN117497003A/en
Publication of CN117497003A publication Critical patent/CN117497003A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, which comprises the following steps: acquiring audio data of any power transformer in real time, and acquiring voiceprint data in a binary format of the audio data and sampling frequency; preprocessing voiceprint data in a binary format; extracting features of voiceprint data in a training set; improving the VGG19 model to obtain an improved VGG19 model; and training the improved VGG19 model by utilizing the voiceprint feature training set, inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voiceprint type of the power transformer. The improved VGG19 model has more distinguishing capability on the characteristics of each channel, and the distinguishing capability on the voiceprint of the transformer is enhanced; the information quantity of the transformer voiceprint samples is emphasized, the transformer voiceprint samples with different information quantities can be dynamically distinguished, and the recognition degree and accuracy of the improved VGG19 model to the transformer voiceprint signals can be improved.

Description

Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax
Technical Field
The invention relates to the technical field of power equipment, in particular to a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax.
Background
The power transformer is used as one of the most critical devices in the power system and bears the difficult tasks of voltage conversion, electric energy transmission and the like, so that the normal and stable operation of the transformer is ensured to have very important significance for the safe operation of the whole power system. When different faults occur in the transformer, different sound signals sent by the transformer contain very rich state information, and the running state of the transformer can be reflected to a great extent. With the rapid development of artificial intelligence, fault diagnosis of mechanical states of transformer equipment by utilizing a voiceprint recognition technology becomes a new research hot spot.
Algorithms including BP neural network, SVM support vector machine, CNN model and LSTM model can not meet the requirements of modern power system in terms of speed and accuracy, and under the challenge of multiple factors, the short plate of the traditional identification method is further highlighted, and the specific expression is: the voice print type of the transformer is mainly divided into a normal voice print signal and an abnormal voice print signal, wherein the normal voice print signal has the types of normal working voice, human voice, normal working voice, bird voice and the like, the normal working voice is the same as rain voice, the abnormal voice print signal has the types of breakdown, bias magnetic working condition, short-circuit impact, partial discharge and the like, and therefore the transformer can be seen to have excessive voice print types. However, the conventional recognition method mainly has the problems of low speed and low accuracy in recognition of the transformer voiceprint at present, on one hand, because the conventional recognition model usually corresponds to downsampling when features are extracted, the number of channels is greatly increased while the length and width dimensions of an image are reduced, and in many channels, a plurality of types of information are contained, and the correlation between some channels and the transformer voiceprint is high, some channels are low, and even some channels are almost not. Because of the channels with low correlation and no correlation with the transformer voiceprint, the accuracy of the recognition model for transformer voiceprint recognition is affected, and the recognition time is prolonged; on the other hand, the classification function part of the traditional recognition model usually takes pure samples as the premise that noise is not contained, but in practical application, the sample collection usually has difficulty in removing external noise, and the part of samples have negative influence on the training effect, so that the classification accuracy of the recognition model on the transformer voiceprints is reduced.
Disclosure of Invention
In order to solve the problems of low recognition speed and low recognition accuracy of the voiceprint of the transformer, the invention aims to provide a method for recognizing the voiceprint of the transformer based on a channel attention mechanism and AH-Softmax, which can rapidly and accurately recognize different types of voiceprint signals of the transformer.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, the method comprising the sequential steps of:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
The step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
The step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω (n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any K, and the discrete Fourier transform DFT shares N (K-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
The step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 maximum pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j Included angle with sample x, θ l,l Is w l Included angle with sample x, s= ||w j || ||x||,f(m),θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j is the total category number of the sample;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, a channel attention mechanism module is added on an original VGG19 model to obtain an improved VGG19 model, the improved VGG19 model has more distinguishing capability on the characteristics of each channel, the relation among the channels and the importance degree of the relation are obtained in training, and the distinguishing capability on the voiceprint of the transformer is enhanced; secondly, replacing the classification function Softmax of the original VGG19 model with a new classification function AH-Softmax, estimating a transformer voiceprint sample tag by using weight indication function distribution as a clue, emphasizing the information quantity of the transformer voiceprint sample, dynamically distinguishing the transformer voiceprint samples with different information quantities, definitely emphasizing the information vector in the transformer voiceprint sample, and simultaneously absorbing the discernability among different transformer voiceprint categories to guide the distinguishing feature learning, and improving the recognition degree and accuracy of the improved VGG19 model to the transformer voiceprint signals through two parts of improvement.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a network structure diagram of the improved VGG19 model.
Detailed Description
As shown in fig. 1, a transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax, the method comprises the following sequential steps:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
The step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
the pre-emphasis is used for adding a zero point to counteract the high-end spectrum amplitude drop caused by glottal pulse, so that the signal spectrum is flattened and the resonance amplitudes are close, only the influence in the sound channel is left in the voice, and the extracted characteristics are more in accordance with the model of the meta-sound channel. A first-order FIR filter is inserted, high frequency is improved, meanwhile, a low frequency part is attenuated, when some fundamental frequencies are assigned to be larger, interference of the fundamental frequencies to formant detection is reduced through pre-emphasis, and meanwhile, the dynamic range of a frequency spectrum is reduced.
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
The frame-to-frame smooth transition is achieved by windowing, and continuity is maintained, i.e., signal discontinuities that may be caused at both ends of each frame are eliminated. Because the truncation has frequency domain energy leakage, a window function is required to reduce the effects of the truncation. Each frame is thus brought into a window function, where the window function selects a hamming window.
The step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω (n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any k, and the discrete Fourier transform DFT shares N (N-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
The step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
as shown in fig. 2, the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 max pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers, and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j Included angle with sample x, θ l,l Is w l Included angle with sample x, s= ||w j || ||x||,f(m,θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j isTotal category number of samples;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
In summary, the channel attention mechanism module is added to the original VGG19 model to obtain an improved VGG19 model, the improved VGG19 model has more distinguishing capability on the characteristics of each channel, the relation among the channels and the importance degree thereof are obtained in training, and the distinguishing capability on the voiceprint of the transformer is enhanced; the classification function Softmax of the original VGG19 model is replaced by a new classification function AH-Softmax, the new classification function AH-Softmax utilizes weight indication function distribution as a clue to estimate the voice print sample labels of the transformer, the information quantity of the voice print samples of the transformer is emphasized, the voice print samples of the transformer with different information quantities can be dynamically distinguished, the information vectors in the voice print samples of the transformer are clearly emphasized, and meanwhile the discernability among voice print categories of different transformers is absorbed to guide the distinguishing characteristic learning of the voice print samples, and the degree and the accuracy of the improved VGG19 model on the recognition of voice print signals of the transformer can be improved through two parts of improvement.

Claims (4)

1. A transformer voiceprint recognition method based on a channel attention mechanism and AH-Softmax is characterized by comprising the following steps: the method comprises the following steps in sequence:
(1) Utilizing r voiceprint acquisition sensors to acquire audio data of any power transformer in real time, wherein each acquired audio data corresponds to a file address; acquiring voiceprint data in a binary format of audio data and sampling frequency;
(2) Preprocessing voiceprint data in a binary format to obtain preprocessed voiceprint data omega (n), forming a data set, and dividing the data set into a training set and a verification set according to a proportion;
(3) Characteristic extraction is carried out on voiceprint data in the training set by utilizing a Mel Frequency Cepstrum Coefficient (MFCC) to obtain voiceprint characteristics and form a voiceprint characteristic training set;
(4) Improving the VGG19 model to obtain an improved VGG19 model;
(5) Training the improved VGG19 model by utilizing the voiceprint feature training set to obtain a trained VGG19 model, and verifying the trained VGG19 model through the verification set;
(6) And inputting the audio data of the power transformer to be identified into the trained VGG19 model, and identifying the voice print type of the power transformer.
2. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (2) specifically comprises the following steps:
(2a) Denoising: removing a mute part and a noise part of voiceprint data in a binary format by using an endpoint detection method based on the audio amplitude to obtain a residual part;
(2b) Pre-emphasis: the loss of the rest part is compensated by utilizing a pre-emphasis method, the voice print data in the input binary format is set as S (n), and the audio after passing through the first-order FIR filter is set asThe formula is shown as formula (1):
in the formula (1), a is a constant, and 0.96 is taken;
(2c) Framing: dividing the audio into N sections of voice signals with fixed sizes by utilizing a framing method, wherein each section of voice signal is a frame, and the frame length frame takes 25ms; the frame division adopts an overlapping segmentation method, the overlapping part of the previous frame and the next frame is frame shift, and the ratio m of the frame shift to the frame length is 0.5; framing a speech signal of length N as shown in equation (2):
the data is divided into n frames, each frame f n The position of (2) is [ m ] frame (n-1), m ] frame (n-1) +frame]If the last frame is at the end (frame + (n-1) +frame)>N, filling the excess part with 0;
(2d) Windowing: each frame is brought into a window function, the window function selects a Hamming window, and the expression of the Hamming window is shown in a formula (3):
wherein R is M (n) is voiceprint data of the nth frame, and ω (n) is voiceprint data after preprocessing.
3. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (3) specifically comprises the following steps:
(3a) The pre-processed voiceprint data ω9n) is decomposed into two sub-signals using a fast fourier transform FFT: even sample point signalAnd odd sample point signal +.>And then add even sample point signal->And odd sample point signal +.>Is equivalent to two sum terms of length +.>The specific calculation process of the discrete Fourier transform DFT is as follows: n additions are performed for any k, and the DFT shares N 2 A secondary multiplication operation; n-1 addition operations are carried out on any k, and the discrete Fourier transform DFT shares N (N-1) addition operations; fast fourier transform FFT sharing N (log) 2 N-1) multiplication operations and N log 2 N addition operations;
(3b) The following operations are performed with a mel filter bank:
(3b1) Determining that the lowest frequency of the voiceprint data processed in the step (3 a) is 0Hz and the highest frequency is f s The number M of the Mel filters is 23;
(3b2) Converting the lowest frequency and the highest frequency into respective mel scales low_mel and high_mel respectively;
(3b3) Calculating the distance d of the center mel frequencies of two adjacent mel filters mel As shown in formula (4):
wherein high_mel is the Mel scale of the highest frequency, low_mel is the Mel scale of the lowest frequency, M is the number of Mel filters;
(3c) Obtaining the spectrum of the voice signal passing through the Mel filter group by using the logarithmic operationAs shown in formula (6):
wherein H (k) is a high-frequency spectrum function, and E (k) is a low-frequency spectrum function;
only the amplitude is considered, as shown in equation (7):
taking logarithm from two sides, as shown in formula (8):
and then taking inverse Fourier transform at two sides to obtain the complex, wherein the complex is shown in a formula (9):
(3d) The length of the voice signal is enlarged to be twice as much as the original length and changed into 2N by using the DCT, in order to make the enlarged signal symmetric about 0, the whole extended signal is shifted to the right by 0.5 units, and the final DCT conversion formula is expressed as (10):
wherein N is the length of the voice signal,for the xth spectrum, when u=0, +.>Otherwise the first set of parameters is selected,u is the generalized frequency;
(3e) Combining dynamic and static characteristics of a frequency spectrum after DCT to improve the recognition performance of a system, wherein a calculation formula of a frequency spectrum differential parameter is as follows:
wherein d t Representing the t first-order difference, i.e. voiceprint characteristics, C t Represents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative.
4. The transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax of claim 1, wherein: the step (4) specifically refers to:
(4a) Adding a channel attention mechanism module into the VGG19 model:
the VGG19 model consists of 2 convolution layers of 64x3x3, 2 convolution layers of 128x3x3, 5 maximum pooling layers of 2x2, 8 convolution layers of 512x3x3, 3 full connection layers and a classification function Softmax; the channel attention mechanism module consists of a global pooling layer of 1x1x512, a full-connection layer of 1x1x64, an activation function layer of 1x1x64, a full-connection layer of 1x1x512 and a logistic regression layer of 1x1x 512; the channel attention mechanism module is added between the last 512x3x3 convolution layer and the last 2x2 max pooling layer of the VGG19 model;
(4b) The original classification function Softmax in the VGG19 model is replaced by a new classification function AH-Softmax, which is shown in the formula (16):
wherein the sample weight indicates a functionp l Probability of sample class l, p j For the probability of sample class j, θ j,l Is w j And sample xIncluded angle theta l,l Is w l Included angle with sample x, s= ||w j ||||x||,f(m,θ l,l )=cos(m,θ l,l ),L j =d(p j )-1,w j The optimization angle of the sample class j is defined, and m is the margin of the boundary loss function; w (w) l An optimized angle for sample class l; j is the total category number of the sample;
h(t,θ j,l ,L j ) And 1 is a re-weighting function for emphasizing the weights of different power transformer voiceprint samples, and the weights are respectively in the following two forms, as shown in the formula (17) and the formula (18):
h(t,θ j,l ,L j )=exp(stL j ) (17)
h(t,θ j,l ,L j )=exp(st(cos(θ j,l )+1)L j ) (18)
in the formula, exp (stL) j ) For a fixed weight function, exp (st (cos (θ) j,l )+1)L j ) Is an adaptive weight function.
CN202311622037.3A 2023-11-30 2023-11-30 Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax Pending CN117497003A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311622037.3A CN117497003A (en) 2023-11-30 2023-11-30 Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311622037.3A CN117497003A (en) 2023-11-30 2023-11-30 Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax

Publications (1)

Publication Number Publication Date
CN117497003A true CN117497003A (en) 2024-02-02

Family

ID=89684949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311622037.3A Pending CN117497003A (en) 2023-11-30 2023-11-30 Transformer voiceprint recognition method based on channel attention mechanism and AH-Softmax

Country Status (1)

Country Link
CN (1) CN117497003A (en)

Similar Documents

Publication Publication Date Title
CN107331384B (en) Audio recognition method, device, computer equipment and storage medium
CN110797002B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112329914B (en) Fault diagnosis method and device for buried transformer substation and electronic equipment
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN102436809B (en) Network speech recognition method in English oral language machine examination system
CN109767776B (en) Deception voice detection method based on dense neural network
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN103559879A (en) Method and device for extracting acoustic features in language identification system
CN105321525A (en) System and method for reducing VOIP (voice over internet protocol) communication resource overhead
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN112735477B (en) Voice emotion analysis method and device
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN115758082A (en) Fault diagnosis method for rail transit transformer
CN111883181A (en) Audio detection method and device, storage medium and electronic device
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Hassan et al. Pattern classification in recognizing Qalqalah Kubra pronuncation using multilayer perceptrons
CN114065809A (en) Method and device for identifying abnormal sound of passenger car, electronic equipment and storage medium
Zhang et al. Temporal Transformer Networks for Acoustic Scene Classification.
CN117789758A (en) Urban audio classification method of convolutional neural network based on residual calculation
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN111341331B (en) Voice enhancement method, device and medium based on local attention mechanism
CN112735466A (en) Audio detection method and device
CN117310668A (en) Underwater sound target identification method integrating attention mechanism and depth residual error shrinkage network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination