CN114446308A

CN114446308A - Multi-channel voiceprint recognition method, device and equipment based on transform framework

Info

Publication number: CN114446308A
Application number: CN202111682904.3A
Authority: CN
Inventors: 潘文安; 谢悦皎
Original assignee: Chinese University of Hong Kong Shenzhen
Current assignee: Chinese University of Hong Kong Shenzhen
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-06

Abstract

The invention discloses a multi-channel voiceprint recognition method, a device and equipment based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting off the same group of three-channel frequency spectrograms I at the same time to obtain two groups of three-channel spectrograms II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises speakers corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.

Description

Multi-channel voiceprint recognition method, device and equipment based on transform framework

Technical Field

The invention relates to the technical field of sound detection, in particular to a multi-channel voiceprint recognition method and device based on a transform framework, equipment and a storage medium.

Background

At present, the traditional methods for the voiceprint recognition task include Gaussian Mixture Model-Universal Background Model (GMM-UBM), Support Vector Machine (SVM) -based GMM-UBMs and the i-vector models (support vector machines based on Gaussian Mixture-Universal Background Model and i-vector Model). These conventional methods are poor in recognition effect on large-scale data, and thus deep learning is introduced to process the large-scale data.

The proposed network is mainly a Convolutional Neural Network (CNN), which includes single-channel CNN and multi-channel CNN, but the final recognition accuracy is also limited due to the limited network complexity of CNN.

In view of the above, it is necessary to provide further improvements to the current voiceprint recognition method.

Disclosure of Invention

Therefore, the present invention is directed to solve the deficiencies in the prior art at least to some extent, and therefore a method, an apparatus, a device and a storage medium for multichannel voiceprint recognition based on a transform framework are provided.

In a first aspect, the present invention provides a transform framework-based multichannel voiceprint recognition method, where the method includes:

carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform;

transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II;

and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information.

In a second aspect, the present invention provides a multichannel voiceprint recognition apparatus based on a transform framework, including:

a transformation module: the system comprises a three-channel spectrogram I, a sound source decomposition module, a short-time Fourier transform module and a data processing module, wherein the three-channel spectrogram I is used for decomposing a sound source of audio information to be identified and then obtaining the three-channel spectrogram I through the short-time Fourier transform;

a processing module: the three-channel frequency spectrum graph I is transposed, filled or cut off simultaneously to obtain two groups of three-channel frequency spectrum graphs II;

an identification module: and the voice print recognition module is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.

In a third aspect, the present invention further provides a voiceprint recognition terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps in the transform framework-based multi-channel voiceprint recognition method according to the first aspect.

In a fourth aspect, the present invention also provides a storage medium, on which a computer program is stored, which when executed, implements the steps in the transform framework-based multi-channel voiceprint recognition method according to the first aspect.

The invention provides a multi-channel voiceprint recognition method, a device and equipment based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a transform frame-based multichannel voiceprint recognition method according to the present invention;

FIG. 2 is a schematic subflow of the transform-frame-based multi-channel voiceprint recognition method of the present invention;

FIG. 3 is a schematic flow chart of a neural network in the transform-framework-based multi-channel voiceprint recognition method according to the present invention;

FIG. 4 is a schematic view of another sub-flow of the transform-framework-based multi-channel voiceprint recognition method of the present invention;

FIG. 5 is a schematic flow chart of a self-attention module of the transform-frame-based multi-channel voiceprint recognition method according to the present invention;

FIG. 6 is a schematic view of another sub-flow of the transform-frame-based multi-channel voiceprint recognition method according to the present invention;

FIG. 7 is a schematic view of another sub-flow of the transform-framework-based multi-channel voiceprint recognition method of the present invention;

FIG. 8 is a schematic flow chart of a contraction self-attention module of the multi-channel voiceprint recognition method based on the transform framework according to the present invention;

FIG. 9 is a schematic diagram of program modules of the transform framework-based multi-channel voiceprint recognition method according to the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a multichannel voiceprint recognition method based on a transform frame in an embodiment of the present application, and in the embodiment, the multichannel voiceprint recognition method based on a transform frame includes:

step 101, performing sound source decomposition on audio information to be identified, and then obtaining a three-channel spectrogram I through short-time Fourier transform.

In this embodiment, first, harmonic-percussive wave sound source separation is performed on audio information to be identified through HPSS (harmonic-percussive wave sound source separation) in libROSA to separate harmonic waves and percussive waves, and then source signals of the harmonic waves and the percussive waves are separated from original audio information to obtain three-channel audio signals, and then, short-time fourier transform is performed on the three-channel audio signals to obtain a three-channel spectrogram i. Where libROSA is a Python package for music and audio analysis.

In this embodiment, when performing short-time fourier transform on three-channel audio signals, the length of the hamming window is set to 320, and the skip length is 160, so that a 161 × n three-channel spectrogram i can be obtained, where n is the length of the audio. In the audio processing stage, different window functions, such as rectangular window (rectangle), hann window (hann), etc., may be used.

And 102, respectively and simultaneously transposing, filling or truncating the same group of three-channel frequency spectrum diagrams I to obtain two groups of three-channel frequency spectrum diagrams II.

In this embodiment, after obtaining a 161 × n three-channel spectrogram i, transposing the three-channel spectrogram i to obtain an nx161 three-channel spectrogram i, and filling or staging the nx161 three-channel spectrogram i to obtain two sets of 300 × 161 three-channel spectrograms ii.

In this embodiment, the same set of three-channel spectrograms i are transposed, padded, or truncated twice at the same time, respectively, so as to obtain two sets of three-channel spectrograms.

103, inputting the two sets of three-channel spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information.

In this embodiment, the obtained two sets of three-channel spectrograms ii are taken as pictures and input into an improved LeViT neural network model, wherein the improved LeViT neural network identifies and processes the voiceprints in the two sets of three-channel spectrograms ii, so as to obtain the voiceprint identification information of the audio information to be identified, and further know the identity information of the speaker of the audio to be identified.

In this embodiment, the improved LeViT neural network model is a transform framework for fast-reasoning vision, the improved LeViT neural network model is used for recognizing images, and is used for recognizing audio data, and a pre-training model obtained based on a large amount of image data can be used as a starting point of a speech recognition model, so that the problem of insufficient data of a large-scale speech recognition task is solved.

In this embodiment, the performance of the transform model relies on a large amount of training data, but not much high-quality speech data, so we apply an image-based pre-training model to migrate the knowledge learned in the image domain to voiceprint recognition. Firstly, a pre-training model using an ImageNet data set is introduced as a starting point of training, and then a high recognition accuracy rate can be obtained only by finely adjusting parameters of the model. The voiceprint recognition experiment is based on VoxColeb 1 data, and two methods for dividing a training set and a test set are adopted. The first is to select 21245 audios first, then divide the audios into sentences as training set, and the remaining audios into sentences as test set data. The second method is to divide the audio directly into sentences, then randomly select 85% of the sentences as training set, and the rest as test set.

In this embodiment, a large amount of training is performed on the improved LeViT neural network model, so that the improved LeViT neural network model can quickly and accurately identify a large amount of audio information, and then audio data to be identified after audio processing is input into the improved LeViT neural network model for identification processing, so that identity information of a speaker corresponding to the audio information to be identified can be obtained.

The embodiment of the application provides a method, a device and equipment for recognizing multichannel voiceprints based on a transform frame, wherein the method comprises the following steps: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method, the audio information is processed to obtain the audio signals of a plurality of channels, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.

Further, referring to fig. 2, fig. 2 is a schematic sub-flow diagram of a multi-channel voiceprint recognition method based on a transform framework in this embodiment of the present application, in this embodiment, an improved LeViT neural network model includes a feature extraction model module, a self-attention-multi-layer perceptron module, and a contraction self-attention module, where the two sets of three-channel spectrograms ii are input to the improved LeViT neural network model and are used to perform recognition processing on a voiceprint of the audio information, and specifically includes:

step 201, extracting the characteristics of the two groups of three-channel spectrograms II through the characteristic extraction module;

step 202, sequentially processing the two groups of three-channel spectrogram II subjected to feature extraction through a first stage, a second stage and a third stage; the first stage and the second stage sequentially comprise the self-attention-multilayer perceptron module, the contraction self-attention module and the multilayer perceptron module, and the third stage sequentially comprises the self-attention-multilayer perceptron module and the average pooling layer module.

In this embodiment, the improved LeViT neural network model includes a feature extraction module, a self-attention-multi-layer perceptron module and a contraction self-attention module, and first, feature extraction is performed on two sets of three-channel spectrogram ii through the feature extraction module, the feature extraction module is composed of 4 layers of convolutional networks, a convolution kernel of each layer of convolutional network is 3 × 3, and a step length is 2, features and links of a spectrogram context can be better extracted by using the convolutional networks, the two sets of three-channel spectrogram ii are input into the improved LeViT neural network model, and a feature tensor of 384 × 14 × 14 is obtained through the feature extraction module in the improved LeViT neural network model. And after passing through the feature extraction module, the two groups of three-channel spectrograms II are respectively processed by a self-attention-multilayer perceptron module and a contraction self-attention module in the improved LeViT neural network model.

In this embodiment, please refer to fig. 3, where fig. 3 is a processing flow of an improved LeViT neural network model, and after feature extraction is performed on two sets of three-channel spectrograms ii, the three stages of a first stage, a second stage and a third stage are sequentially performed on the two sets of three-channel spectrograms ii, where the first stage sequentially includes a self-attention-multilayer perceptron module, a contraction self-attention module and a multilayer perceptron module; the second stage comprises a self-attention-multilayer perceptron module, a contraction self-attention module and a multilayer perceptron module in sequence; the third stage includes a self-attention-multi-tier perceptron module and an average pooling layer module in sequence. The self-attention module can learn the weight of each audio feature, and the multilayer perceptron module can map the audio features to a larger dimension, so that more information of the audio features can be represented; the average pooling layer is used for extracting the feature map of the last convolutional layer, outputting the feature points of each feature map, constructing a feature vector according to all the feature points, and inputting the feature vector into a SoftMax classifier (namely the supervised classifier in FIG. 3) according to the category corresponding to the classification task, so as to obtain the identity information of the speaker corresponding to the audio information to be identified.

In this embodiment, the three-channel spectrogram ii is processed in the first stage to obtain a 384 × 14 × 14 tensor dimension; the tensor dimension obtained in the second stage is 512 multiplied by 14; the tensor dimensionality obtained in the third stage is 768 × 4 × 4; wherein the tensor represents a feature of the frequency spectrum.

Furthermore, transposing, filling or truncating the same group of three-channel spectrogram i at the same time respectively to obtain two groups of three-channel spectrogram ii, further comprising:

and filtering the three-channel frequency spectrum diagram I through a Mel filter.

In this embodiment, when performing short-time fourier transform on three-channel audio signals, the length of the hamming window is set to 1024, and the skip length is 512, so that a 513 × n three-channel spectrogram i can be obtained, where n is the length of the audio. In order to reduce the calculation overhead, the three-channel spectrogram I is processed by a Mel-frequency converter to obtain a 64 × n spectrogram, and then the 64 × n spectrogram is transposed, filled or truncated to obtain another set of three-channel spectrogram II of 300 × 64, wherein besides a Mel filter, transformation such as Discrete Cosine Transform (DCT) can be performed to obtain Mel-frequency Cepstral Coefficients (MFCC, Mel frequency cepstrum Coefficients).

In this embodiment, the first way is to directly transpose, fill up or truncate the three-channel spectrogram i; the second mode is to filter the three-channel spectrogram I by a Mel filter, and transpose, fill or cut the three-channel spectrogram I after the filtering. The two modes are used for processing the three-channel spectrogram I at the same time, so that two groups of three-channel spectrograms II are obtained.

In this embodiment, the spectrogram filtered by the mel filter has sound frequencies closer to the frequencies audible to human beings, and the reduced spectral dimension can reduce the calculation overhead, but the accuracy of the final identification is lower than that of the spectrogram not filtered by the mel filter because of the loss of partial information.

Further, referring to fig. 4, fig. 4 is another schematic sub-flow diagram of a multi-channel voiceprint recognition method based on a transform frame in the embodiment of the present application, where in the embodiment, the processing of the three-channel spectrogram ii by the self-attention module specifically includes:

step 301, performing linear transformation on the two groups of three-channel spectrograms II after feature extraction, and then calculating first self-attention;

step 302, transposing and dimension changing are performed on the first self-attention, the transposed and dimension changing are input into an activation function for calculation, and then a first vector dimension is obtained through a first linear layer.

In this embodiment, the two sets of three-channel spectrograms ii after feature extraction are subjected to linear transformation, which is simpler and more direct than a convolution network. The first self-attention can be obtained through calculation, and the specific tensor dimension can be obtained by finally connecting a first linear layer after the transposition, the dimension change and the hardtwish activation function of the first self-attention. Wherein each stage of the self-attention-multi-layer perceptron module comprises 4 layers, wherein the dimensions of the tensor are not changed.

Further, referring to fig. 6, fig. 6 is a schematic view of another sub-flow of a multi-channel voiceprint recognition method based on a transform frame in an embodiment of the present application, in this embodiment, performing linear transformation on the two sets of three-channel spectrograms ii after feature extraction, and then calculating the first self-attention specifically includes:

step 401, performing linear transformation on the two groups of three-channel spectrogram II after feature extraction to obtain parameters in the first self-attention, wherein the parameters at least comprise Q, K, V, wherein Q is a query parameter, K is a key correlation parameter, and V is a value queried parameter;

step 402, calculating the first self-attention by the parameter Q, K, V.

In the present embodiment, please refer to fig. 5 and fig. 6, in which fig. 5 is a self-attention module, C represents the number of channel channels, H represents the height of het, W represents the width of width, D represents the number of head heads (appearing in the self-attention module), and N represents the blocksize; the first self-attention parameters Q (query), K (key), V (value) are obtained by performing linear transformation on tensor dimensions after feature extraction, and then are substituted into the following formula (1), and the first self-attention is obtained through the following formula (1).

Further, referring to fig. 7, fig. 7 is a schematic view of another sub-flow of the multichannel voiceprint recognition method based on the transform frame in the embodiment of the present application, where in the embodiment, the contracting self-attention module processes the three-channel spectrogram ii, specifically including:

step 501, sampling the two groups of three-channel frequency spectrograms II processed by the self-attention-multilayer perceptron module to obtain a second self-attention parameter Q, K, V;

step 502, reducing the length and width of the parameter Q by half, and then calculating the second self attention;

step 503, transposing the second self-attention, changing dimensions, inputting the transposed and changed dimensions into an activation function for calculation, and obtaining a second tensor dimension through a second linear layer.

In this embodiment, please refer to fig. 7 and 8, wherein fig. 8 is a shrinking self-attention module, C represents the number of channel channels, H represents the height of heatlight, W represents the width of width, D represents the number of head heads, and N represents the size of batchsize batch; the difference between the contracting self-attention module and the self-attention module is that the length and width of the parameter Q in the contracting self-attention module are reduced by half, which results in the length and width of the finally output vector being half of the original length and width, and the structure of residual error connection is not used because the length and width of the parameter Q in the contracting self-attention module is reduced by half. Wherein the contraction self-attention module reduces the dimensions of the length and width of the audio features while learning the weights.

In this embodiment, when the two sets of three-channel spectrograms ii are processed by the self-attention-multi-layer perceptron module and then processed by the self-attention-contracting module, the self-attention-contracting module samples the two sets of three-channel spectrograms ii first, and then performs linear transformation to obtain a second self-attention parameter Q, K, V, wherein the length and width of the parameter Q is reduced by half, and then calculates the second self-attention by the above formula (1), and then performs transposition-dimension transformation, activation of a function and a second linear layer on the obtained second self-attention to obtain a second tensor dimension.

Further, after the two sets of three-channel spectrograms ii after feature extraction are processed sequentially through the first stage, the second stage and the third stage, the method further includes:

and inputting the two groups of three-channel spectrograms II processed by the third stage into a softmax classifier to obtain a speaker corresponding to the audio information.

In this embodiment, the softmax classifier is configured to classify and identify the obtained tensors, and the softmax classifier may determine a probability of each class, where the class with the highest probability is the last class to be output, so as to obtain the identity information of the speaker corresponding to the audio information.

Further, in this embodiment, a VoxCeleb1 data set is used to perform a voiceprint recognition experiment, and two partitioning manners are adopted for the data set, where the first partitioning manner is to partition a training set and a test set at a video slice level, and the second partitioning manner is to partition the training set and the test set at a statement level. The first way of partitioning can cause video slices in the test set to be completely absent in the training set, which is more challenging. GMM-UBM (Gaussian Mixture Model-Universal Background Model) and I-vectors + SVM (Support Vector Machine) are traditional methods for voiceprint recognition, and the data are not ideal in performance. In the second partitioning strategy, and with only STFT, we proposed transform-based single-channel model top1 with accuracy 92.97%, while we proposed multi-channel model top1 with accuracy 94.94% and 98.94%, improved by about 2% compared to single-channel model top 1. In general, the second partitioning strategy performed better than the first strategy, because in the first strategy the audio to which the sentences in the test set belong did not appear in the training set at all. In addition, the spectrum using only STFT has a higher accuracy than the spectrum using the mel filter because the mel filter reduces the complexity of the data and saves the calculation cost.

Further, an embodiment of the present invention further provides a multi-channel voiceprint recognition apparatus 600 based on a transform frame, and fig. 9 is a schematic diagram of program modules of the multi-channel voiceprint recognition apparatus based on the transform frame in the embodiment of the present invention, in this embodiment, the multi-channel voiceprint recognition apparatus 600 based on the transform frame includes:

the transformation module 601: the system comprises a three-channel spectrogram I, a sound source decomposition module, a short-time Fourier transform module and a data processing module, wherein the three-channel spectrogram I is used for decomposing a sound source of audio information to be identified and then obtaining the three-channel spectrogram I through the short-time Fourier transform;

the processing module 602: the three-channel frequency spectrum images I are transposed, filled or truncated simultaneously to obtain two three-channel frequency spectrum images II;

the recognition module 603: and the voice print recognition module is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.

The embodiment of the application provides a multichannel voiceprint recognition device 600 based on a transform frame, which can realize that: carrying out sound source decomposition on audio information to be identified, and obtaining a three-channel spectrogram I through short-time Fourier transform; transposing, filling or cutting the same group of three-channel frequency spectrum diagrams I respectively and simultaneously to obtain two groups of three-channel frequency spectrum diagrams II; and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying the voiceprint of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises a speaker corresponding to the audio information. By the method provided by the invention, the audio signals of a plurality of channels are obtained by processing the audio information, so that the effect of higher accuracy of audio identification is achieved; and the audio signal can be more accurately identified through the improved LeViT neural network model, so that the complexity of the model is increased, and the accuracy is greatly improved.

Further, the present application also provides a voiceprint recognition terminal, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps in the above multi-channel voiceprint recognition method based on the transform framework.

Further, the present application also provides a storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps in the transform framework-based multi-channel voiceprint recognition method as described above.

Each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium.

Based on such understanding, the technical solution of the present invention, which is described in the specification or contributes to the prior art in essence, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

For those skilled in the art, according to the idea of the embodiments of the present application, there may be variations in the specific implementation and application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention.

Claims

1. A multichannel voiceprint recognition method based on a transform framework is characterized by comprising the following steps:

and inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, and identifying and processing the voiceprints of the audio information by using the improved LeViT neural network model to output voiceprint identification information of the audio information, wherein the voiceprint identification information at least comprises speakers corresponding to the audio information.

2. The method according to claim 1, wherein the improved LeViT neural network model comprises a feature extraction model module, a self-attention-multi-layer perceptron module and a contraction self-attention module, and the inputting the two sets of three-channel spectrograms II into the improved LeViT neural network model and using the improved LeViT neural network model to perform recognition processing on the voiceprint of the audio information specifically comprises:

extracting the characteristics of the two groups of three-channel frequency spectrograms II through the characteristic extraction module;

sequentially processing the two groups of three-channel frequency spectrograms II subjected to feature extraction through a first stage, a second stage and a third stage; the first stage and the second stage sequentially comprise the self-attention-multilayer perceptron module, the contraction self-attention module and the multilayer perceptron module, and the third stage sequentially comprises the self-attention-multilayer perceptron module and the average pooling layer module.

3. The method of claim 1, wherein transposing, filling or truncating the same set of three-channel spectrograms i simultaneously to obtain two sets of three-channel spectrograms ii, respectively, further comprises:

4. The method of claim 2, wherein the self-attention module processing the three-channel spectrogram ii specifically comprises:

performing linear transformation on the two groups of three-channel frequency spectrograms II after feature extraction, and then calculating first self-attention;

and transposing the first self-attention, changing the dimensionality, inputting the transposed and dimensionality into an activation function for calculation, and obtaining a first vector dimensionality through a first linear layer.

5. The method according to claim 4, wherein the performing linear transformation on the two sets of three-channel spectrograms II after feature extraction and then calculating the first self-attention specifically comprises:

performing linear transformation on the two groups of three-channel spectrogram II after feature extraction to obtain parameters in the first self-attention, wherein the parameters at least comprise Q, K, V, Q is a query parameter, K is a key correlation parameter, and V is a value queried parameter;

the first self-attention is calculated by the parameter Q, K, V.

6. The method of claim 3, wherein the contracting self-attention module processes the three-channel spectrogram II, specifically comprising:

sampling the two groups of three-channel frequency spectrograms II processed by the self-attention-multilayer perceptron module to obtain a second self-attention parameter Q, K, V;

reducing the length and the width of the parameter Q by half, and then calculating the second self-attention;

and transposing and changing dimensionality of the second self-attention, inputting the transposed and dimensionality changed self-attention into an activation function for calculation, and obtaining a second tensor dimensionality through a second linear layer.

7. The method according to claim 2, wherein the processing the two sets of three-channel spectrograms ii after feature extraction sequentially through the first stage, the second stage and the third stage further comprises:

8. A multi-channel voiceprint recognition device based on a transform framework is characterized by comprising:

an identification module: and the voice print recognition information is used for inputting the two groups of three-channel frequency spectrograms II into an improved LeViT neural network model, recognizing the voice print of the audio information by using the improved LeViT neural network model, and outputting the voice print recognition information of the audio information, wherein the voice print recognition information at least comprises a speaker corresponding to the audio information.

9. A voiceprint recognition terminal comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor, when executing said computer program, implements the steps of the transform framework based multi-channel voiceprint recognition method according to any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the transform framework based multi-channel voiceprint recognition method of any one of claims 1 to 7.