CN115035901A - Voiceprint recognition method based on neural network and related device - Google Patents
Voiceprint recognition method based on neural network and related device Download PDFInfo
- Publication number
- CN115035901A CN115035901A CN202210635522.3A CN202210635522A CN115035901A CN 115035901 A CN115035901 A CN 115035901A CN 202210635522 A CN202210635522 A CN 202210635522A CN 115035901 A CN115035901 A CN 115035901A
- Authority
- CN
- China
- Prior art keywords
- voiceprint
- semi
- orthogonal
- network model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 86
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 46
- 238000003062 neural network model Methods 0.000 claims abstract description 30
- 238000012360 testing method Methods 0.000 claims description 33
- 238000007781 pre-processing Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 abstract description 23
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000001914 filtration Methods 0.000 abstract description 4
- 230000004913 activation Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a voiceprint recognition method and a related device based on a neural network, wherein the method comprises the following steps: constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected with each other through a series connection mode, an inner hop connection structure and an outer hop connection structure; performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model; and identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result. The semi-orthogonal one-dimensional convolution layers can decompose an original parameter matrix in the network, can compress a redundant parameter expression space, and reduce time delay span while filtering noise interference. The method and the device can solve the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result lacks accuracy and reliability.
Description
Technical Field
The present application relates to the field of voiceprint recognition technologies, and in particular, to a voiceprint recognition method and a related apparatus based on a neural network.
Background
The voiceprint recognition system is a system for automatically recognizing the identity of a speaker according to the characteristics of human voice, and the voiceprint recognition technology belongs to one of biological verification technologies, namely, the identity of the speaker is verified through voice. The technology has the characteristics of better convenience, stability, measurability, safety and the like, and is generally used in the fields of banks, social security, public security, intelligent home, mobile payment and the like.
The existing voiceprint recognition method is limited by noise influence in voiceprint information, so that the recognition result lacks accuracy and reliability, and the delay modeling capability of the voiceprint recognition based on the neural network model is limited, so that the actual voiceprint recognition effect is poor, and the high-standard application requirement cannot be met.
Disclosure of Invention
The application provides a voiceprint recognition method based on a neural network and a related device, which are used for solving the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result lacks accuracy and reliability.
In view of the above, a first aspect of the present application provides a voiceprint recognition method based on a neural network, including:
constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected with each other through an inner hop connecting structure and an outer hop connecting structure;
performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;
and identifying the target voiceprint by adopting the target identification network model to obtain a voiceprint identification result.
Preferably, the voiceprint recognition training of the semi-orthogonal decomposition neural network model is performed according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and the method further includes:
preprocessing the training voiceprint information to obtain an audio frame to be processed, wherein the preprocessing operation comprises weighting, framing and windowing;
based on a Fourier transform algorithm, calculating the audio frame by adopting a Mel filter to obtain MFCC characteristics;
and constructing a preset MFCC training set according to the MFCC characteristics.
Preferably, the performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model further includes:
constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;
performing voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;
and in the voiceprint information registration process, performing feature extraction on the newly added voiceprint through the target voiceprint feature extractor, and storing the extracted voiceprint features in a database.
Preferably, the performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and then further comprising:
performing voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
screening the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, the identifying the target voiceprint by using the target identification network model to obtain the voiceprint identification result includes:
and identifying the target voiceprint by adopting the optimized identification network model to obtain a voiceprint identification result.
A second aspect of the present application provides a voiceprint recognition apparatus based on a neural network, including:
the model building module is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected through an inner jump connection structure and an outer jump connection structure;
the model training module is used for carrying out voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;
and the voiceprint recognition module is used for recognizing the target voiceprint by adopting the target recognition network model to obtain a voiceprint recognition result.
Preferably, the method further comprises the following steps:
the preprocessing module is used for preprocessing the training voiceprint information to obtain an audio frame to be processed, and the preprocessing operation comprises weighting, framing and windowing;
the feature extraction module is used for calculating the audio frame by adopting a Mel filter based on a Fourier transform algorithm to obtain MFCC features;
and the training set constructing module is used for constructing a preset MFCC training set according to the MFCC characteristics.
Preferably, the method further comprises the following steps:
an extractor construction module for constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;
the extractor training module is used for carrying out voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;
and the extractor using module is used for extracting the characteristics of the newly added voiceprints through the target voiceprint characteristic extractor in the voiceprint information registration process and storing the extracted voiceprint characteristics in a database.
Preferably, the method further comprises the following steps:
the test module is used for carrying out voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
the optimization module is used for screening the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, the voiceprint recognition module is specifically configured to:
and identifying the target voiceprint by adopting the optimized identification network model to obtain a voiceprint identification result.
A third aspect of the present application provides a voiceprint recognition device based on a neural network, the device comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the neural network based voiceprint recognition method of the first aspect according to instructions in the program code.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for executing the neural network-based voiceprint recognition method of the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
in the present application, a voiceprint recognition method based on a neural network is provided, including: constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected with each other through a series connection mode, an inner hop connection structure and an outer hop connection structure; performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model; and identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result.
According to the voiceprint recognition method based on the neural network, in the process of constructing the semi-orthogonal decomposition neural network model, the convolution layers are connected through the jump connection structure, and voiceprint characteristic information of the shallow layer is directly transmitted to the deep layer convolution layer, so that the deep layer network obtains richer voiceprint information, and the anti-noise capability of the network is improved; the semi-orthogonal one-dimensional convolution layers can decompose the original parameter matrix in the network, can compress a redundant parameter expression space, reduces the time delay span while filtering noise interference, and achieves the purpose of long-time-delay learning. Therefore, the method and the device can solve the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result is lack of accuracy and reliability.
Drawings
Fig. 1 is a schematic flowchart of a voiceprint recognition method based on a neural network according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a voiceprint recognition apparatus based on a neural network according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a semi-orthogonal decomposition neural network model provided in an embodiment of the present application;
fig. 4 is a schematic network structure diagram of a half-orthogonal volume block according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For easy understanding, please refer to fig. 1, the present application provides an embodiment of a method for voiceprint recognition based on a neural network, including:
It should be noted that, currently, the mainstream Deep Neural Network (DNN) and voiceprint recognition application is proposed by a Time Delay Neural Network (TDNN), and a penultimate layer or a second hidden layer of the TDNN is used as a voiceprint feature output, which is called x-vector. The TDNN is mainly built by a multi-layer one-dimensional Convolutional Neural Network (CNN) component. However, the one-dimensional convolution component has the capability of describing image or voice characteristics in a multi-scale manner, which is superior to a common full-connection manner, but the TDNN has a reduced recognition effect in a strong noise environment and has insufficient noise resistance. In addition, the TDNN has limited delay modeling capability and can only perform effective learning within a short, steady time range.
Therefore, in the embodiment, the weight matrix of the neural network is decomposed by using the semi-orthogonal convolution layer, so that the parameter quantity of the original one-dimensional convolution weight layer is reduced to a great extent; under the condition of supervised learning speaker tags, the semi-orthogonal decomposition neural network extracts important speaker voice print information from each factorization based on finite parameters and filters irrelevant noise information, thereby exerting the anti-noise capability. In addition, the time delay modeling range of the semi-orthogonal convolution layer is limited, if an excessively large span is set, sampling leakage occurs, and information filtering is deteriorated.
Referring to fig. 3, the semi-orthogonal decomposition neural network model in this example includes a plurality of unequal semi-orthogonal convolution blocks, and adjacent semi-orthogonal convolution blocks are connected in series, and in addition, each semi-orthogonal convolution block is spliced with one, two or more outer-hop connection structures or outputs of inner-hop connection structures from the second semi-orthogonal convolution block of the network model. It should be noted that the external hop connection structure also includes network structures such as a semi-orthogonal one-dimensional convolution layer and an activation function, and the output of the external hop connection structure is transmitted to a second semi-orthogonal convolution block and a later deep semi-orthogonal convolution block in the network model. The inner-hop connection structure starts from the second semi-orthogonal convolution block, and shallow layer characteristic information needs to be transmitted to the later deep layer semi-orthogonal convolution block. As can be seen from the example given in fig. 3, the inputs of the second half-orthogonal convolution block include a serial input and a skip-out input; the inputs to the third half-orthogonal convolution block include a series input, an out-hop connection input and an in-hop connection input, and so on. Specifically, the number of the semi-orthogonal convolution blocks included in the network model may be determined according to actual needs, and is not limited herein as long as the convolution block connection thinking of the present embodiment is met.
The semi-orthogonal convolution block includes a plurality of semi-orthogonal one-dimensional convolution layers, and the number is at least 2, so that the semi-orthogonal convolution block is a plurality of sections of semi-orthogonal convolution blocks, please refer to fig. 4, an arc arrow in fig. 4 is a data flow direction of an inner jump connection structure and an outer jump connection structure, the semi-orthogonal convolution block includes an outer jump splicing layer, an activation function, a regular layer and an output layer besides the semi-orthogonal one-dimensional convolution layers, and the output layer includes input information fused into the inner jump connection structure and output information of the regular layer; the outer hop splicing layer can receive output information of outer hop connection structures input by other semi-orthogonal convolution blocks and combine the output information with other information received by the layer in a splicing mode. It will be appreciated that the outer hop stitching layer may receive the output information of a plurality of unequal outer hop connection structures.
It should be noted that, the semi-orthogonal one-dimensional convolutional layer can decompose the input parameter matrix a [ a, a ], and the constraint decomposed parameter matrix M [ a, B ] conforms to the semi-orthogonal decomposition, so that the effective voiceprint information of the output matrix B [ B, B ] can be retained, where a and B are both matrix dimension degrees, and the constraint decomposition formula is as follows:
A=MB
wherein alpha is a floating-point coefficient and is 1 by default; and I is an identity matrix.
When the constraint converges, there will be:
A=MB≈MM T
i.e. the output matrix B is approximately equal to the parameter matrix M T 。
Performing convolution operation on the decomposed matrix B or the spliced matrix by using a parameter matrix Nb, a of one-dimensional convolution, and learning information on different scales by using a plurality of convolution kernels to generate a matrix Pa, a; wherein, the matrix P is different from the output result R [ a, a ] of the ordinary one-dimensional convolution TDNN, and assuming that the parameter matrix of the ordinary one-dimensional convolution TDNN is W [ a, a ], the ordinary one-dimensional convolution and the semi-orthogonal decomposition one-dimensional convolution have the following differences:
ordinary one-dimensional convolution: a W → R
Semi-orthogonal decomposition one-dimensional convolution: a x MN → P
When the dimensionality reduction of the parameter matrix M is equal to or less than a/4, the total parameter number of M and N is equal to or less than half of the parameter number of W. Multiple audio frequencies of each training speaker have multiple noise factor differences, and in the supervised learning speaker label, the neural network can learn speaker commonalities under multiple noises. The semi-orthogonal one-dimensional convolution is used for compressing a redundant parameter representation space by decomposing the original matrix, so that the speaker information can be refined, and noise interference can be filtered; the modeling value of the noise information is as follows:
ε=W-MN。
activating the function and the regular layer, and performing nonlinear activation processing and re-regular on a matrix P output by the semi-orthogonal one-dimensional convolutional layer to obtain hidden layer information Q; and finally, the output layer performs integration operation such as addition or splicing on the matrix Q after activation and normalization and the jump connection input matrix A, wherein the selection of the embodiment is weight addition, the weight default is 0.66, and the weight addition is finished into an output matrix O.
And 102, performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to the training voiceprint information to obtain a target recognition network model.
Data of training of the semi-orthogonal decomposition neural network model is acoustic features, namely Mel Frequency Cepstrum Coefficient (MFCC), for a training set with labels, cross entropy calculation loss can be conducted on output results of the model and the labels, further training of the model is optimized, and a target recognition network model is obtained.
Further, step 102, before, further comprising:
preprocessing the training voiceprint information to obtain an audio frame to be processed, wherein the preprocessing operation comprises weighting, framing and windowing;
based on a Fourier transform algorithm, an audio frame is calculated by adopting a Mel filter to obtain MFCC characteristics;
and constructing a preset MFCC training set according to the MFCC characteristics.
Further, step 102, further includes:
constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;
performing voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;
and in the voiceprint information registration process, extracting the characteristics of the newly added voiceprints through a target voiceprint characteristic extractor, and storing the extracted voiceprint characteristics in a database.
The feature extractor has the same network structure as the semi-orthogonal decomposition neural network model, and the essence is that the voiceprint feature output is carried out on the first layer of the final pooling layer of the semi-orthogonal decomposition neural network model, and the output of the identification result of the final full connection layer is not carried out. The training process of the target voiceprint feature extractor is the same as that of the recognition model; and the feature extractor can be used for an initial voice information registration process and can also be applied to an acquisition process of a verification set.
Further, step 102, thereafter, further includes:
performing voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
screening the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, step 103 includes:
and identifying the target voiceprint by adopting an optimized identification network model to obtain a voiceprint identification result.
And 103, identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result.
According to the voiceprint recognition method based on the neural network, in the process of constructing the semi-orthogonal decomposition neural network model, the convolution layers are connected through the jump connection structure, and voiceprint characteristic information of the shallow layer is directly transmitted to the deep layer convolution layers, so that the deep layer network can obtain richer voiceprint information, and the anti-noise capability of the network is improved; the semi-orthogonal one-dimensional convolution layers can decompose the original parameter matrix in the network, can compress a redundant parameter expression space, reduces the time delay span while filtering noise interference, and achieves the purpose of long-time-delay learning. Therefore, the method and the device for identifying the voiceprint can solve the technical problems that the anti-noise capability of the existing voiceprint identification technology is poor, the time delay modeling capability is limited, and the identification result is lack of accuracy and reliability.
For ease of understanding, referring to fig. 2, the present application provides an embodiment of a neural network-based voiceprint recognition apparatus, comprising:
the model building module 201 is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in a series mode through an inner jump connection structure and an outer jump connection structure;
the model training module 202 is used for performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;
and the voiceprint recognition module 203 is configured to recognize the target voiceprint by using the target recognition network model to obtain a voiceprint recognition result.
Further, still include:
the preprocessing module 204 is configured to perform preprocessing operations on the training voiceprint information to obtain an audio frame to be processed, where the preprocessing operations include weighting, framing, and windowing;
a feature extraction module 205, configured to calculate an audio frame by using a mel filter based on a fourier transform algorithm to obtain MFCC features;
a training set constructing module 206, configured to construct a preset MFCC training set according to the MFCC characteristics.
Further, still include:
an extractor construction module 207 for constructing a semi-orthogonal decomposition feature extractor based on the plurality of semi-orthogonal convolution blocks;
an extractor training module 208, configured to perform voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information, to obtain a target voiceprint feature extractor;
and the extractor using module 209 is used for performing feature extraction on the newly added voiceprint through the target voiceprint feature extractor in the process of registering the voiceprint information, and storing the extracted voiceprint features in the database.
Further, still include:
the test module 210 is configured to perform a voiceprint recognition test on the target recognition network model by using a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
the optimization module 211 is configured to screen the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, the voiceprint recognition module 203 is specifically configured to:
and identifying the target voiceprint by adopting an optimized identification network model to obtain a voiceprint identification result.
The application also provides a voiceprint recognition device based on the neural network, and the device comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is used for executing the neural network-based voiceprint recognition method in the above method embodiment according to the instructions in the program code.
The present application also provides a computer-readable storage medium for storing program code for executing the neural network-based voiceprint recognition method in the above-described method embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for executing all or part of the steps of the methods described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A voiceprint recognition method based on a neural network is characterized by comprising the following steps:
constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected with each other through an inner hop connecting structure and an outer hop connecting structure;
performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;
and identifying the target voiceprint by adopting the target identification network model to obtain a voiceprint identification result.
2. The method for voiceprint recognition based on neural network according to claim 1, wherein the voiceprint recognition training is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and before the method, the method further comprises:
preprocessing the training voiceprint information to obtain an audio frame to be processed, wherein the preprocessing operation comprises weighting, framing and windowing;
based on a Fourier transform algorithm, calculating the audio frame by adopting a Mel filter to obtain MFCC characteristics;
and constructing a preset MFCC training set according to the MFCC characteristics.
3. The method according to claim 1, wherein the training of voiceprint recognition is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and further comprising:
constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;
performing voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;
and in the voiceprint information registration process, performing feature extraction on the newly added voiceprint through the target voiceprint feature extractor, and storing the extracted voiceprint features in a database.
4. The method for voiceprint recognition based on neural network according to claim 1, wherein the voiceprint recognition training is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and then further comprising:
performing voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
screening the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, the identifying the target voiceprint by using the target identification network model to obtain the voiceprint identification result includes:
and identifying the target voiceprint by adopting the optimized identification network model to obtain a voiceprint identification result.
5. A voiceprint recognition apparatus based on a neural network, comprising:
the model building module is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in a series mode through an inner jump connection structure and an outer jump connection structure;
the model training module is used for carrying out voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;
and the voiceprint recognition module is used for recognizing the target voiceprint by adopting the target recognition network model to obtain a voiceprint recognition result.
6. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:
the preprocessing module is used for preprocessing the training voiceprint information to obtain an audio frame to be processed, and the preprocessing operation comprises weighting, framing and windowing;
the feature extraction module is used for calculating the audio frame by adopting a Mel filter based on a Fourier transform algorithm to obtain MFCC features;
and the training set constructing module is used for constructing a preset MFCC training set according to the MFCC characteristics.
7. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:
an extractor construction module for constructing a semi-orthogonal decomposition feature extractor based on the plurality of semi-orthogonal convolution blocks;
the extractor training module is used for carrying out voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;
and the extractor using module is used for extracting the characteristics of the newly added voiceprints through the target voiceprint characteristic extractor in the voiceprint information registration process and storing the extracted voiceprint characteristics in a database.
8. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:
the test module is used for carrying out voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;
the optimization module is used for screening the target recognition network model according to the test result to obtain an optimized recognition network model;
correspondingly, the voiceprint recognition module is specifically configured to:
and identifying the target voiceprint by adopting the optimized identification network model to obtain a voiceprint identification result.
9. A neural network-based voiceprint recognition apparatus, the apparatus comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the neural network-based voiceprint recognition method of any one of claims 1-4 according to instructions in the program code.
10. A computer-readable storage medium for storing program code for performing the neural network-based voiceprint recognition method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210635522.3A CN115035901A (en) | 2022-06-07 | 2022-06-07 | Voiceprint recognition method based on neural network and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210635522.3A CN115035901A (en) | 2022-06-07 | 2022-06-07 | Voiceprint recognition method based on neural network and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115035901A true CN115035901A (en) | 2022-09-09 |
Family
ID=83122551
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210635522.3A Pending CN115035901A (en) | 2022-06-07 | 2022-06-07 | Voiceprint recognition method based on neural network and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115035901A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115831127A (en) * | 2023-01-09 | 2023-03-21 | 浙江大学 | Voiceprint reconstruction model construction method and device based on voice conversion and storage medium |
-
2022
- 2022-06-07 CN CN202210635522.3A patent/CN115035901A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115831127A (en) * | 2023-01-09 | 2023-03-21 | 浙江大学 | Voiceprint reconstruction model construction method and device based on voice conversion and storage medium |
CN115831127B (en) * | 2023-01-09 | 2023-05-05 | 浙江大学 | Voiceprint reconstruction model construction method and device based on voice conversion and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102159217B1 (en) | Electronic device, identification method, system and computer-readable storage medium | |
CN109326299B (en) | Speech enhancement method, device and storage medium based on full convolution neural network | |
CN108597505B (en) | Voice recognition method and device and terminal equipment | |
CN105976812A (en) | Voice identification method and equipment thereof | |
CN110428842A (en) | Speech model training method, device, equipment and computer readable storage medium | |
CN106683680A (en) | Speaker recognition method and device and computer equipment and computer readable media | |
CN110956957A (en) | Training method and system of speech enhancement model | |
CN106898355B (en) | Speaker identification method based on secondary modeling | |
CN112687263A (en) | Voice recognition neural network model, training method thereof and voice recognition method | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN111932296B (en) | Product recommendation method and device, server and storage medium | |
CN111009238B (en) | Method, device and equipment for recognizing spliced voice | |
CN111508524B (en) | Method and system for identifying voice source equipment | |
CN111986679A (en) | Speaker confirmation method, system and storage medium for responding to complex acoustic environment | |
CN109658943B (en) | Audio noise detection method and device, storage medium and mobile terminal | |
CN112530410A (en) | Command word recognition method and device | |
CN115035901A (en) | Voiceprint recognition method based on neural network and related device | |
CN112183582A (en) | Multi-feature fusion underwater target identification method | |
CN113763966B (en) | End-to-end text irrelevant voiceprint recognition method and system | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
WO2021179198A1 (en) | Image feature visualization method, image feature visualization apparatus, and electronic device | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
CN116542783A (en) | Risk assessment method, device, equipment and storage medium based on artificial intelligence | |
CN116343798A (en) | Verification method and device for speaker identity in far-field scene and electronic equipment | |
CN114141256A (en) | Voiceprint feature extraction model construction method and system based on wavelet neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |