CN114446308A - Multi-channel voiceprint recognition method, device and equipment based on transform framework - Google Patents
Multi-channel voiceprint recognition method, device and equipment based on transform framework Download PDFInfo
- Publication number
- CN114446308A CN114446308A CN202111682904.3A CN202111682904A CN114446308A CN 114446308 A CN114446308 A CN 114446308A CN 202111682904 A CN202111682904 A CN 202111682904A CN 114446308 A CN114446308 A CN 114446308A
- Authority
- CN
- China
- Prior art keywords
- channel
- module
- self
- attention
- audio information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000003062 neural network model Methods 0.000 claims abstract description 40
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 10
- 238000000605 extraction Methods 0.000 claims description 23
- 238000001228 spectrum Methods 0.000 claims description 23
- 238000010586 diagram Methods 0.000 claims description 18
- 230000009466 transformation Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 230000008602 contraction Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 230000005236 sound signal Effects 0.000 abstract description 12
- 230000000694 effects Effects 0.000 abstract description 5
- 238000012549 training Methods 0.000 description 13
- 238000012360 testing method Methods 0.000 description 7
- 238000000638 solvent extraction Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012706 support-vector machine Methods 0.000 description 5
- 101100153581 Bacillus anthracis topX gene Proteins 0.000 description 2
- 101150041570 TOP1 gene Proteins 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a multi-channel voiceprint recognition method, a device and equipment based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting off the same group of three-channel frequency spectrograms I at the same time to obtain two groups of three-channel spectrograms II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises speakers corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.
Description
Technical Field
The invention relates to the technical field of sound detection, in particular to a multi-channel voiceprint recognition method and device based on a transform framework, equipment and a storage medium.
Background
At present, the traditional methods for the voiceprint recognition task include Gaussian Mixture Model-Universal Background Model (GMM-UBM), Support Vector Machine (SVM) -based GMM-UBMs and the i-vector models (support vector machines based on Gaussian Mixture-Universal Background Model and i-vector Model). These conventional methods are poor in recognition effect on large-scale data, and thus deep learning is introduced to process the large-scale data.
The proposed network is mainly a Convolutional Neural Network (CNN), which includes single-channel CNN and multi-channel CNN, but the final recognition accuracy is also limited due to the limited network complexity of CNN.
In view of the above, it is necessary to provide further improvements to the current voiceprint recognition method.
Disclosure of Invention
Therefore, the present invention is directed to solve the deficiencies in the prior art at least to some extent, and therefore a method, an apparatus, a device and a storage medium for multichannel voiceprint recognition based on a transform framework are provided.
In a first aspect, the present invention provides a transform framework-based multichannel voiceprint recognition method, where the method includes:
carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform;
transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II;
and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information.
In a second aspect, the present invention provides a multichannel voiceprint recognition apparatus based on a transform framework, including:
a transformation module: the system comprises a three-channel spectrogram I, a sound source decomposition module, a short-time Fourier transform module and a data processing module, wherein the three-channel spectrogram I is used for decomposing a sound source of audio information to be identified and then obtaining the three-channel spectrogram I through the short-time Fourier transform;
a processing module: the three-channel frequency spectrum graph I is transposed, filled or cut off simultaneously to obtain two groups of three-channel frequency spectrum graphs II;
an identification module: and the voice print recognition module is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.
In a third aspect, the present invention further provides a voiceprint recognition terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps in the transform framework-based multi-channel voiceprint recognition method according to the first aspect.
In a fourth aspect, the present invention also provides a storage medium, on which a computer program is stored, which when executed, implements the steps in the transform framework-based multi-channel voiceprint recognition method according to the first aspect.
The invention provides a multi-channel voiceprint recognition method, a device and equipment based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a transform frame-based multichannel voiceprint recognition method according to the present invention;
FIG. 2 is a schematic subflow of the transform-frame-based multi-channel voiceprint recognition method of the present invention;
FIG. 3 is a schematic flow chart of a neural network in the transform-framework-based multi-channel voiceprint recognition method according to the present invention;
FIG. 4 is a schematic view of another sub-flow of the transform-framework-based multi-channel voiceprint recognition method of the present invention;
FIG. 5 is a schematic flow chart of a self-attention module of the transform-frame-based multi-channel voiceprint recognition method according to the present invention;
FIG. 6 is a schematic view of another sub-flow of the transform-frame-based multi-channel voiceprint recognition method according to the present invention;
FIG. 7 is a schematic view of another sub-flow of the transform-framework-based multi-channel voiceprint recognition method of the present invention;
FIG. 8 is a schematic flow chart of a contraction self-attention module of the multi-channel voiceprint recognition method based on the transform framework according to the present invention;
FIG. 9 is a schematic diagram of program modules of the transform framework-based multi-channel voiceprint recognition method according to the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a multichannel voiceprint recognition method based on a transform frame in an embodiment of the present application, and in the embodiment, the multichannel voiceprint recognition method based on a transform frame includes:
In this embodiment, first, harmonic-percussive wave sound source separation is performed on audio information to be identified through HPSS (harmonic-percussive wave sound source separation) in libROSA to separate harmonic waves and percussive waves, and then source signals of the harmonic waves and the percussive waves are separated from original audio information to obtain three-channel audio signals, and then, short-time fourier transform is performed on the three-channel audio signals to obtain a three-channel spectrogram i. Where libROSA is a Python package for music and audio analysis.
In this embodiment, when performing short-time fourier transform on three-channel audio signals, the length of the hamming window is set to 320, and the skip length is 160, so that a 161 × n three-channel spectrogram i can be obtained, where n is the length of the audio. In the audio processing stage, different window functions, such as rectangular window (rectangle), hann window (hann), etc., may be used.
And 102, respectively and simultaneously transposing, filling or truncating the same group of three-channel frequency spectrum diagrams I to obtain two groups of three-channel frequency spectrum diagrams II.
In this embodiment, after obtaining a 161 × n three-channel spectrogram i, transposing the three-channel spectrogram i to obtain an nx161 three-channel spectrogram i, and filling or staging the nx161 three-channel spectrogram i to obtain two sets of 300 × 161 three-channel spectrograms ii.
In this embodiment, the same set of three-channel spectrograms i are transposed, padded, or truncated twice at the same time, respectively, so as to obtain two sets of three-channel spectrograms.
103, inputting the two sets of three-channel spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information.
In this embodiment, the obtained two sets of three-channel spectrograms ii are taken as pictures and input into an improved LeViT neural network model, wherein the improved LeViT neural network identifies and processes the voiceprints in the two sets of three-channel spectrograms ii, so as to obtain the voiceprint identification information of the audio information to be identified, and further know the identity information of the speaker of the audio to be identified.
In this embodiment, the improved LeViT neural network model is a transform framework for fast-reasoning vision, the improved LeViT neural network model is used for recognizing images, and is used for recognizing audio data, and a pre-training model obtained based on a large amount of image data can be used as a starting point of a speech recognition model, so that the problem of insufficient data of a large-scale speech recognition task is solved.
In this embodiment, the performance of the transform model relies on a large amount of training data, but not much high-quality speech data, so we apply an image-based pre-training model to migrate the knowledge learned in the image domain to voiceprint recognition. Firstly, a pre-training model using an ImageNet data set is introduced as a starting point of training, and then a high recognition accuracy rate can be obtained only by finely adjusting parameters of the model. The voiceprint recognition experiment is based on VoxColeb 1 data, and two methods for dividing a training set and a test set are adopted. The first is to select 21245 audios first, then divide the audios into sentences as training set, and the remaining audios into sentences as test set data. The second method is to divide the audio directly into sentences, then randomly select 85% of the sentences as training set, and the rest as test set.
In this embodiment, a large amount of training is performed on the improved LeViT neural network model, so that the improved LeViT neural network model can quickly and accurately identify a large amount of audio information, and then audio data to be identified after audio processing is input into the improved LeViT neural network model for identification processing, so that identity information of a speaker corresponding to the audio information to be identified can be obtained.
The embodiment of the application provides a method, a device and equipment for recognizing multichannel voiceprints based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method, the audio information is processed to obtain the audio signals of a plurality of channels, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.
Further, referring to fig. 2, fig. 2 is a schematic sub-flow diagram of a multi-channel voiceprint recognition method based on a transform framework in this embodiment of the present application, in this embodiment, an improved LeViT neural network model includes a feature extraction model module, a self-attention-multi-layer perceptron module, and a contraction self-attention module, where the two sets of three-channel spectrograms ii are input to the improved LeViT neural network model and are used to perform recognition processing on a voiceprint of the audio information, and specifically includes:
In this embodiment, the improved LeViT neural network model includes a feature extraction module, a self-attention-multi-layer perceptron module and a contraction self-attention module, and first, feature extraction is performed on two sets of three-channel spectrogram ii through the feature extraction module, the feature extraction module is composed of 4 layers of convolutional networks, a convolution kernel of each layer of convolutional network is 3 × 3, and a step length is 2, features and links of a spectrogram context can be better extracted by using the convolutional networks, the two sets of three-channel spectrogram ii are input into the improved LeViT neural network model, and a feature tensor of 384 × 14 × 14 is obtained through the feature extraction module in the improved LeViT neural network model. And after passing through the feature extraction module, the two groups of three-channel spectrograms II are respectively processed by a self-attention-multilayer perceptron module and a contraction self-attention module in the improved LeViT neural network model.
In this embodiment, please refer to fig. 3, where fig. 3 is a processing flow of an improved LeViT neural network model, and after feature extraction is performed on two sets of three-channel spectrograms ii, the three stages of a first stage, a second stage and a third stage are sequentially performed on the two sets of three-channel spectrograms ii, where the first stage sequentially includes a self-attention-multilayer perceptron module, a contraction self-attention module and a multilayer perceptron module; the second stage comprises a self-attention-multilayer perceptron module, a contraction self-attention module and a multilayer perceptron module in sequence; the third stage includes a self-attention-multi-tier perceptron module and an average pooling layer module in sequence. The self-attention module can learn the weight of each audio feature, and the multilayer perceptron module can map the audio features to a larger dimension, so that more information of the audio features can be represented; the average pooling layer is used for extracting the feature map of the last convolutional layer, outputting the feature points of each feature map, constructing a feature vector according to all the feature points, and inputting the feature vector into a SoftMax classifier (namely the supervised classifier in FIG. 3) according to the category corresponding to the classification task, so as to obtain the identity information of the speaker corresponding to the audio information to be identified.
In this embodiment, the three-channel spectrogram ii is processed in the first stage to obtain a 384 × 14 × 14 tensor dimension; the tensor dimension obtained in the second stage is 512 multiplied by 14; the tensor dimensionality obtained in the third stage is 768 × 4 × 4; wherein the tensor represents a feature of the frequency spectrum.
Furthermore, transposing, filling or truncating the same group of three-channel spectrogram i at the same time respectively to obtain two groups of three-channel spectrogram ii, further comprising:
and filtering the three-channel frequency spectrum diagram I through a Mel filter.
In this embodiment, when performing short-time fourier transform on three-channel audio signals, the length of the hamming window is set to 1024, and the skip length is 512, so that a 513 × n three-channel spectrogram i can be obtained, where n is the length of the audio. In order to reduce the calculation overhead, the three-channel spectrogram I is processed by a Mel-frequency converter to obtain a 64 × n spectrogram, and then the 64 × n spectrogram is transposed, filled or truncated to obtain another set of three-channel spectrogram II of 300 × 64, wherein besides a Mel filter, transformation such as Discrete Cosine Transform (DCT) can be performed to obtain Mel-frequency Cepstral Coefficients (MFCC, Mel frequency cepstrum Coefficients).
In this embodiment, the first way is to directly transpose, fill up or truncate the three-channel spectrogram i; the second mode is to filter the three-channel spectrogram I by a Mel filter, and transpose, fill or cut the three-channel spectrogram I after the filtering. The two modes are used for processing the three-channel spectrogram I at the same time, so that two groups of three-channel spectrograms II are obtained.
In this embodiment, the spectrogram filtered by the mel filter has sound frequencies closer to the frequencies audible to human beings, and the reduced spectral dimension can reduce the calculation overhead, but the accuracy of the final identification is lower than that of the spectrogram not filtered by the mel filter because of the loss of partial information.
Further, referring to fig. 4, fig. 4 is another schematic sub-flow diagram of a multi-channel voiceprint recognition method based on a transform frame in the embodiment of the present application, where in the embodiment, the processing of the three-channel spectrogram ii by the self-attention module specifically includes:
In this embodiment, the two sets of three-channel spectrograms ii after feature extraction are subjected to linear transformation, which is simpler and more direct than a convolution network. The first self-attention can be obtained through calculation, and the specific tensor dimension can be obtained by finally connecting a first linear layer after the transposition, the dimension change and the hardtwish activation function of the first self-attention. Wherein each stage of the self-attention-multi-layer perceptron module comprises 4 layers, wherein the dimensions of the tensor are not changed.
Further, referring to fig. 6, fig. 6 is a schematic view of another sub-flow of a multi-channel voiceprint recognition method based on a transform frame in an embodiment of the present application, in this embodiment, performing linear transformation on the two sets of three-channel spectrograms ii after feature extraction, and then calculating the first self-attention specifically includes:
In the present embodiment, please refer to fig. 5 and fig. 6, in which fig. 5 is a self-attention module, C represents the number of channel channels, H represents the height of het, W represents the width of width, D represents the number of head heads (appearing in the self-attention module), and N represents the blocksize; the first self-attention parameters Q (query), K (key), V (value) are obtained by performing linear transformation on tensor dimensions after feature extraction, and then are substituted into the following formula (1), and the first self-attention is obtained through the following formula (1).
Further, referring to fig. 7, fig. 7 is a schematic view of another sub-flow of the multichannel voiceprint recognition method based on the transform frame in the embodiment of the present application, where in the embodiment, the contracting self-attention module processes the three-channel spectrogram ii, specifically including:
In this embodiment, please refer to fig. 7 and 8, wherein fig. 8 is a shrinking self-attention module, C represents the number of channel channels, H represents the height of heatlight, W represents the width of width, D represents the number of head heads, and N represents the size of batchsize batch; the difference between the contracting self-attention module and the self-attention module is that the length and width of the parameter Q in the contracting self-attention module are reduced by half, which results in the length and width of the finally output vector being half of the original length and width, and the structure of residual error connection is not used because the length and width of the parameter Q in the contracting self-attention module is reduced by half. Wherein the contraction self-attention module reduces the dimensions of the length and width of the audio features while learning the weights.
In this embodiment, when the two sets of three-channel spectrograms ii are processed by the self-attention-multi-layer perceptron module and then processed by the self-attention-contracting module, the self-attention-contracting module samples the two sets of three-channel spectrograms ii first, and then performs linear transformation to obtain a second self-attention parameter Q, K, V, wherein the length and width of the parameter Q is reduced by half, and then calculates the second self-attention by the above formula (1), and then performs transposition-dimension transformation, activation of a function and a second linear layer on the obtained second self-attention to obtain a second tensor dimension.
Further, after the two sets of three-channel spectrograms ii after feature extraction are processed sequentially through the first stage, the second stage and the third stage, the method further includes:
and inputting the two groups of three-channel spectrograms II processed by the third stage into a softmax classifier to obtain a speaker corresponding to the audio information.
In this embodiment, the softmax classifier is configured to classify and identify the obtained tensors, and the softmax classifier may determine a probability of each class, where the class with the highest probability is the last class to be output, so as to obtain the identity information of the speaker corresponding to the audio information.
Further, in this embodiment, a VoxCeleb1 data set is used to perform a voiceprint recognition experiment, and two partitioning manners are adopted for the data set, where the first partitioning manner is to partition a training set and a test set at a video slice level, and the second partitioning manner is to partition the training set and the test set at a statement level. The first way of partitioning can cause video slices in the test set to be completely absent in the training set, which is more challenging. GMM-UBM (Gaussian Mixture Model-Universal Background Model) and I-vectors + SVM (Support Vector Machine) are traditional methods for voiceprint recognition, and the data are not ideal in performance. In the second partitioning strategy, and with only STFT, we proposed transform-based single-channel model top1 with accuracy 92.97%, while we proposed multi-channel model top1 with accuracy 94.94% and 98.94%, improved by about 2% compared to single-channel model top 1. In general, the second partitioning strategy performed better than the first strategy, because in the first strategy the audio to which the sentences in the test set belong did not appear in the training set at all. In addition, the spectrum using only STFT has a higher accuracy than the spectrum using the mel filter because the mel filter reduces the complexity of the data and saves the calculation cost.
Further, an embodiment of the present invention further provides a multi-channel voiceprint recognition apparatus 600 based on a transform frame, and fig. 9 is a schematic diagram of program modules of the multi-channel voiceprint recognition apparatus based on the transform frame in the embodiment of the present invention, in this embodiment, the multi-channel voiceprint recognition apparatus 600 based on the transform frame includes:
the transformation module 601: the system comprises a three-channel spectrogram I, a sound source decomposition module, a short-time Fourier transform module and a data processing module, wherein the three-channel spectrogram I is used for decomposing a sound source of audio information to be identified and then obtaining the three-channel spectrogram I through the short-time Fourier transform;
the processing module 602: the three-channel frequency spectrum images I are transposed, filled or truncated simultaneously to obtain two three-channel frequency spectrum images II;
the recognition module 603: and the voice print recognition module is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.
The embodiment of the application provides a multichannel voiceprint recognition device 600 based on a transform frame, which can realize that: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.
Further, the present application also provides a voiceprint recognition terminal, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps in the above multi-channel voiceprint recognition method based on the transform framework.
Further, the present application also provides a storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps in the transform framework-based multi-channel voiceprint recognition method as described above.
Each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium.
Based on such understanding, the technical solution of the present invention, which is described in the specification or contributes to the prior art in essence, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
For those skilled in the art, according to the idea of the embodiments of the present application, there may be variations in the specific implementation and application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention.
Claims (10)
1. A multichannel voiceprint recognition method based on a transform framework is characterized by comprising the following steps:
carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform;
transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II;
and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises speakers corresponding to the audio information.
2. The method according to claim 1, wherein the improved LeViT neural network model comprises a feature extraction model module, a self-attention-multi-layer perceptron module and a contraction self-attention module, and the inputting the two sets of three-channel spectrograms II into the improved LeViT neural network model and using the improved LeViT neural network model to perform recognition processing on the voiceprint of the audio information specifically comprises:
extracting the characteristics of the two groups of three-channel frequency spectrograms II through the characteristic extraction module;
sequentially processing the two groups of three-channel frequency spectrograms II subjected to feature extraction through a first stage, a second stage and a third stage; the first stage and the second stage sequentially comprise the self-attention-multilayer perceptron module, the contraction self-attention module and the multilayer perceptron module, and the third stage sequentially comprises the self-attention-multilayer perceptron module and the average pooling layer module.
3. The method of claim 1, wherein transposing, filling or truncating the same set of three-channel spectrograms i simultaneously to obtain two sets of three-channel spectrograms ii, respectively, further comprises:
and filtering the three-channel frequency spectrum diagram I through a Mel filter.
4. The method of claim 2, wherein the self-attention module processing the three-channel spectrogram ii specifically comprises:
performing linear transformation on the two groups of three-channel frequency spectrograms II after feature extraction, and then calculating first self-attention;
and transposing the first self-attention, changing the dimensionality, inputting the transposed and dimensionality into an activation function for calculation, and obtaining a first vector dimensionality through a first linear layer.
5. The method according to claim 4, wherein the performing linear transformation on the two sets of three-channel spectrograms II after feature extraction and then calculating the first self-attention specifically comprises:
performing linear transformation on the two groups of three-channel spectrogram II after feature extraction to obtain parameters in the first self-attention, wherein the parameters at least comprise Q, K, V, Q is a query parameter, K is a key correlation parameter, and V is a value queried parameter;
the first self-attention is calculated by the parameter Q, K, V.
6. The method of claim 3, wherein the contracting self-attention module processes the three-channel spectrogram II, specifically comprising:
sampling the two groups of three-channel frequency spectrograms II processed by the self-attention-multilayer perceptron module to obtain a second self-attention parameter Q, K, V;
reducing the length and the width of the parameter Q by half, and then calculating the second self-attention;
and transposing and changing dimensionality of the second self-attention, inputting the transposed and dimensionality changed self-attention into an activation function for calculation, and obtaining a second tensor dimensionality through a second linear layer.
7. The method according to claim 2, wherein the processing the two sets of three-channel spectrograms ii after feature extraction sequentially through the first stage, the second stage and the third stage further comprises:
and inputting the two groups of three-channel spectrograms II processed by the third stage into a softmax classifier to obtain a speaker corresponding to the audio information.
8. A multi-channel voiceprint recognition device based on a transform framework is characterized by comprising:
a transformation module: the system comprises a three-channel spectrogram I, a sound source decomposition module, a short-time Fourier transform module and a data processing module, wherein the three-channel spectrogram I is used for decomposing a sound source of audio information to be identified and then obtaining the three-channel spectrogram I through the short-time Fourier transform;
a processing module: the three-channel frequency spectrum graph I is transposed, filled or cut off simultaneously to obtain two groups of three-channel frequency spectrum graphs II;
an identification module: and the voice print recognition information is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.
9. A voiceprint recognition terminal comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor, when executing said computer program, implements the steps of the transform framework based multi-channel voiceprint recognition method according to any one of claims 1 to 7.
10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the transform framework based multi-channel voiceprint recognition method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111682904.3A CN114446308A (en) | 2021-12-31 | 2021-12-31 | Multi-channel voiceprint recognition method, device and equipment based on transform framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111682904.3A CN114446308A (en) | 2021-12-31 | 2021-12-31 | Multi-channel voiceprint recognition method, device and equipment based on transform framework |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114446308A true CN114446308A (en) | 2022-05-06 |
Family
ID=81365402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111682904.3A Pending CN114446308A (en) | 2021-12-31 | 2021-12-31 | Multi-channel voiceprint recognition method, device and equipment based on transform framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114446308A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116612783A (en) * | 2023-07-17 | 2023-08-18 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
-
2021
- 2021-12-31 CN CN202111682904.3A patent/CN114446308A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116612783A (en) * | 2023-07-17 | 2023-08-18 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN116612783B (en) * | 2023-07-17 | 2023-10-27 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Luo et al. | Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation | |
Qian et al. | Very deep convolutional neural networks for noise robust speech recognition | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
Imtiaz et al. | Isolated word automatic speech recognition (ASR) system using MFCC, DTW & KNN | |
CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
Ting Yuan et al. | Frog sound identification system for frog species recognition | |
Ayoub et al. | Gammatone frequency cepstral coefficients for speaker identification over VoIP networks | |
Ghezaiel et al. | Wavelet scattering transform and CNN for closed set speaker identification | |
Lei et al. | Speaker Recognition Using Wavelet Cepstral Coefficient, I‐Vector, and Cosine Distance Scoring and Its Application for Forensics | |
Ghezaiel et al. | Hybrid network for end-to-end text-independent speaker identification | |
Ghiurcau et al. | Speaker recognition in an emotional environment | |
CN114446308A (en) | Multi-channel voiceprint recognition method, device and equipment based on transform framework | |
Soni et al. | State-of-the-art analysis of deep learning-based monaural speech source separation techniques | |
Gaikwad et al. | Classification of Indian classical instruments using spectral and principal component analysis based cepstrum features | |
Tu et al. | Mutual Information Enhanced Training for Speaker Embedding. | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
CN115472168B (en) | Short-time voice voiceprint recognition method, system and equipment for coupling BGCC and PWPE features | |
Renisha et al. | Cascaded Feedforward Neural Networks for speaker identification using Perceptual Wavelet based Cepstral Coefficients | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
Adam et al. | Wavelet based Cepstral Coefficients for neural network speech recognition | |
Shi et al. | Speaker re-identification with speaker dependent speech enhancement | |
Sunny et al. | Development of a speech recognition system for speaker independent isolated Malayalam words | |
Venkataramani et al. | End-to-end non-negative autoencoders for sound source separation | |
Kilinc et al. | Audio Deepfake Detection by using Machine and Deep Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |