CN115035901A

CN115035901A - Voiceprint recognition method based on neural network and related device

Info

Publication number: CN115035901A
Application number: CN202210635522.3A
Authority: CN
Inventors: 李国伟; 王俊波; 唐琪; 张殷; 黎小龙; 范心明; 李新; 董镝; 宋安琪; 刘崧; 梁年柏; 谢志杨; 李志锦; 严司玮; 蒋维; 武利会; 陈志平; 王志刚; 张伟忠; 何胜红
Original assignee: Guangdong Power Grid Co Ltd; Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Guangdong Power Grid Co Ltd; Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-09

Abstract

The application discloses a voiceprint recognition method and a related device based on a neural network, wherein the method comprises the following steps: constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected with each other through a series connection mode, an inner hop connection structure and an outer hop connection structure; performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model; and identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result. The semi-orthogonal one-dimensional convolution layers can decompose an original parameter matrix in the network, can compress a redundant parameter expression space, and reduce time delay span while filtering noise interference. The method and the device can solve the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result lacks accuracy and reliability.

Description

Voiceprint recognition method based on neural network and related device

Technical Field

The present application relates to the field of voiceprint recognition technologies, and in particular, to a voiceprint recognition method and a related apparatus based on a neural network.

Background

The voiceprint recognition system is a system for automatically recognizing the identity of a speaker according to the characteristics of human voice, and the voiceprint recognition technology belongs to one of biological verification technologies, namely, the identity of the speaker is verified through voice. The technology has the characteristics of better convenience, stability, measurability, safety and the like, and is generally used in the fields of banks, social security, public security, intelligent home, mobile payment and the like.

The existing voiceprint recognition method is limited by noise influence in voiceprint information, so that the recognition result lacks accuracy and reliability, and the delay modeling capability of the voiceprint recognition based on the neural network model is limited, so that the actual voiceprint recognition effect is poor, and the high-standard application requirement cannot be met.

Disclosure of Invention

The application provides a voiceprint recognition method based on a neural network and a related device, which are used for solving the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result lacks accuracy and reliability.

In view of the above, a first aspect of the present application provides a voiceprint recognition method based on a neural network, including:

constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected with each other through an inner hop connecting structure and an outer hop connecting structure;

performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;

and identifying the target voiceprint by adopting the target identification network model to obtain a voiceprint identification result.

Preferably, the voiceprint recognition training of the semi-orthogonal decomposition neural network model is performed according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and the method further includes:

preprocessing the training voiceprint information to obtain an audio frame to be processed, wherein the preprocessing operation comprises weighting, framing and windowing;

based on a Fourier transform algorithm, calculating the audio frame by adopting a Mel filter to obtain MFCC characteristics;

and constructing a preset MFCC training set according to the MFCC characteristics.

Preferably, the performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model further includes:

constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;

performing voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;

and in the voiceprint information registration process, performing feature extraction on the newly added voiceprint through the target voiceprint feature extractor, and storing the extracted voiceprint features in a database.

Preferably, the performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and then further comprising:

performing voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;

screening the target recognition network model according to the test result to obtain an optimized recognition network model;

correspondingly, the identifying the target voiceprint by using the target identification network model to obtain the voiceprint identification result includes:

and identifying the target voiceprint by adopting the optimized identification network model to obtain a voiceprint identification result.

A second aspect of the present application provides a voiceprint recognition apparatus based on a neural network, including:

the model building module is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected through an inner jump connection structure and an outer jump connection structure;

the model training module is used for carrying out voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;

and the voiceprint recognition module is used for recognizing the target voiceprint by adopting the target recognition network model to obtain a voiceprint recognition result.

Preferably, the method further comprises the following steps:

the preprocessing module is used for preprocessing the training voiceprint information to obtain an audio frame to be processed, and the preprocessing operation comprises weighting, framing and windowing;

the feature extraction module is used for calculating the audio frame by adopting a Mel filter based on a Fourier transform algorithm to obtain MFCC features;

and the training set constructing module is used for constructing a preset MFCC training set according to the MFCC characteristics.

Preferably, the method further comprises the following steps:

an extractor construction module for constructing a semi-orthogonal decomposition feature extractor based on a plurality of semi-orthogonal convolution blocks;

the extractor training module is used for carrying out voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information to obtain a target voiceprint feature extractor;

and the extractor using module is used for extracting the characteristics of the newly added voiceprints through the target voiceprint characteristic extractor in the voiceprint information registration process and storing the extracted voiceprint characteristics in a database.

Preferably, the method further comprises the following steps:

the test module is used for carrying out voiceprint recognition test on the target recognition network model by adopting a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;

the optimization module is used for screening the target recognition network model according to the test result to obtain an optimized recognition network model;

correspondingly, the voiceprint recognition module is specifically configured to:

A third aspect of the present application provides a voiceprint recognition device based on a neural network, the device comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the neural network based voiceprint recognition method of the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for executing the neural network-based voiceprint recognition method of the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

in the present application, a voiceprint recognition method based on a neural network is provided, including: constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected with each other through a series connection mode, an inner hop connection structure and an outer hop connection structure; performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model; and identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result.

According to the voiceprint recognition method based on the neural network, in the process of constructing the semi-orthogonal decomposition neural network model, the convolution layers are connected through the jump connection structure, and voiceprint characteristic information of the shallow layer is directly transmitted to the deep layer convolution layer, so that the deep layer network obtains richer voiceprint information, and the anti-noise capability of the network is improved; the semi-orthogonal one-dimensional convolution layers can decompose the original parameter matrix in the network, can compress a redundant parameter expression space, reduces the time delay span while filtering noise interference, and achieves the purpose of long-time-delay learning. Therefore, the method and the device can solve the technical problems that the anti-noise capability of the existing voiceprint recognition technology is poor, the time delay modeling capability is limited, and the recognition result is lack of accuracy and reliability.

Drawings

Fig. 1 is a schematic flowchart of a voiceprint recognition method based on a neural network according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a voiceprint recognition apparatus based on a neural network according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a semi-orthogonal decomposition neural network model provided in an embodiment of the present application;

fig. 4 is a schematic network structure diagram of a half-orthogonal volume block according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, please refer to fig. 1, the present application provides an embodiment of a method for voiceprint recognition based on a neural network, including:

step 101, constructing a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, wherein each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in series, and are connected through an inner hop connection structure and an outer hop connection structure.

It should be noted that, currently, the mainstream Deep Neural Network (DNN) and voiceprint recognition application is proposed by a Time Delay Neural Network (TDNN), and a penultimate layer or a second hidden layer of the TDNN is used as a voiceprint feature output, which is called x-vector. The TDNN is mainly built by a multi-layer one-dimensional Convolutional Neural Network (CNN) component. However, the one-dimensional convolution component has the capability of describing image or voice characteristics in a multi-scale manner, which is superior to a common full-connection manner, but the TDNN has a reduced recognition effect in a strong noise environment and has insufficient noise resistance. In addition, the TDNN has limited delay modeling capability and can only perform effective learning within a short, steady time range.

Therefore, in the embodiment, the weight matrix of the neural network is decomposed by using the semi-orthogonal convolution layer, so that the parameter quantity of the original one-dimensional convolution weight layer is reduced to a great extent; under the condition of supervised learning speaker tags, the semi-orthogonal decomposition neural network extracts important speaker voice print information from each factorization based on finite parameters and filters irrelevant noise information, thereby exerting the anti-noise capability. In addition, the time delay modeling range of the semi-orthogonal convolution layer is limited, if an excessively large span is set, sampling leakage occurs, and information filtering is deteriorated.

Referring to fig. 3, the semi-orthogonal decomposition neural network model in this example includes a plurality of unequal semi-orthogonal convolution blocks, and adjacent semi-orthogonal convolution blocks are connected in series, and in addition, each semi-orthogonal convolution block is spliced with one, two or more outer-hop connection structures or outputs of inner-hop connection structures from the second semi-orthogonal convolution block of the network model. It should be noted that the external hop connection structure also includes network structures such as a semi-orthogonal one-dimensional convolution layer and an activation function, and the output of the external hop connection structure is transmitted to a second semi-orthogonal convolution block and a later deep semi-orthogonal convolution block in the network model. The inner-hop connection structure starts from the second semi-orthogonal convolution block, and shallow layer characteristic information needs to be transmitted to the later deep layer semi-orthogonal convolution block. As can be seen from the example given in fig. 3, the inputs of the second half-orthogonal convolution block include a serial input and a skip-out input; the inputs to the third half-orthogonal convolution block include a series input, an out-hop connection input and an in-hop connection input, and so on. Specifically, the number of the semi-orthogonal convolution blocks included in the network model may be determined according to actual needs, and is not limited herein as long as the convolution block connection thinking of the present embodiment is met.

The semi-orthogonal convolution block includes a plurality of semi-orthogonal one-dimensional convolution layers, and the number is at least 2, so that the semi-orthogonal convolution block is a plurality of sections of semi-orthogonal convolution blocks, please refer to fig. 4, an arc arrow in fig. 4 is a data flow direction of an inner jump connection structure and an outer jump connection structure, the semi-orthogonal convolution block includes an outer jump splicing layer, an activation function, a regular layer and an output layer besides the semi-orthogonal one-dimensional convolution layers, and the output layer includes input information fused into the inner jump connection structure and output information of the regular layer; the outer hop splicing layer can receive output information of outer hop connection structures input by other semi-orthogonal convolution blocks and combine the output information with other information received by the layer in a splicing mode. It will be appreciated that the outer hop stitching layer may receive the output information of a plurality of unequal outer hop connection structures.

It should be noted that, the semi-orthogonal one-dimensional convolutional layer can decompose the input parameter matrix a [ a, a ], and the constraint decomposed parameter matrix M [ a, B ] conforms to the semi-orthogonal decomposition, so that the effective voiceprint information of the output matrix B [ B, B ] can be retained, where a and B are both matrix dimension degrees, and the constraint decomposition formula is as follows:

A＝MB

wherein alpha is a floating-point coefficient and is 1 by default; and I is an identity matrix.

When the constraint converges, there will be:

A＝MB≈MM ^T

i.e. the output matrix B is approximately equal to the parameter matrix M ^T 。

Performing convolution operation on the decomposed matrix B or the spliced matrix by using a parameter matrix Nb, a of one-dimensional convolution, and learning information on different scales by using a plurality of convolution kernels to generate a matrix Pa, a; wherein, the matrix P is different from the output result R [ a, a ] of the ordinary one-dimensional convolution TDNN, and assuming that the parameter matrix of the ordinary one-dimensional convolution TDNN is W [ a, a ], the ordinary one-dimensional convolution and the semi-orthogonal decomposition one-dimensional convolution have the following differences:

ordinary one-dimensional convolution: a W → R

Semi-orthogonal decomposition one-dimensional convolution: a x MN → P

When the dimensionality reduction of the parameter matrix M is equal to or less than a/4, the total parameter number of M and N is equal to or less than half of the parameter number of W. Multiple audio frequencies of each training speaker have multiple noise factor differences, and in the supervised learning speaker label, the neural network can learn speaker commonalities under multiple noises. The semi-orthogonal one-dimensional convolution is used for compressing a redundant parameter representation space by decomposing the original matrix, so that the speaker information can be refined, and noise interference can be filtered; the modeling value of the noise information is as follows:

ε＝W-MN。

activating the function and the regular layer, and performing nonlinear activation processing and re-regular on a matrix P output by the semi-orthogonal one-dimensional convolutional layer to obtain hidden layer information Q; and finally, the output layer performs integration operation such as addition or splicing on the matrix Q after activation and normalization and the jump connection input matrix A, wherein the selection of the embodiment is weight addition, the weight default is 0.66, and the weight addition is finished into an output matrix O.

And 102, performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to the training voiceprint information to obtain a target recognition network model.

Data of training of the semi-orthogonal decomposition neural network model is acoustic features, namely Mel Frequency Cepstrum Coefficient (MFCC), for a training set with labels, cross entropy calculation loss can be conducted on output results of the model and the labels, further training of the model is optimized, and a target recognition network model is obtained.

Further, step 102, before, further comprising:

based on a Fourier transform algorithm, an audio frame is calculated by adopting a Mel filter to obtain MFCC characteristics;

Further, step 102, further includes:

and in the voiceprint information registration process, extracting the characteristics of the newly added voiceprints through a target voiceprint characteristic extractor, and storing the extracted voiceprint characteristics in a database.

The feature extractor has the same network structure as the semi-orthogonal decomposition neural network model, and the essence is that the voiceprint feature output is carried out on the first layer of the final pooling layer of the semi-orthogonal decomposition neural network model, and the output of the identification result of the final full connection layer is not carried out. The training process of the target voiceprint feature extractor is the same as that of the recognition model; and the feature extractor can be used for an initial voice information registration process and can also be applied to an acquisition process of a verification set.

Further, step 102, thereafter, further includes:

correspondingly, step 103 includes:

and identifying the target voiceprint by adopting an optimized identification network model to obtain a voiceprint identification result.

And 103, identifying the target voiceprint by adopting a target identification network model to obtain a voiceprint identification result.

According to the voiceprint recognition method based on the neural network, in the process of constructing the semi-orthogonal decomposition neural network model, the convolution layers are connected through the jump connection structure, and voiceprint characteristic information of the shallow layer is directly transmitted to the deep layer convolution layers, so that the deep layer network can obtain richer voiceprint information, and the anti-noise capability of the network is improved; the semi-orthogonal one-dimensional convolution layers can decompose the original parameter matrix in the network, can compress a redundant parameter expression space, reduces the time delay span while filtering noise interference, and achieves the purpose of long-time-delay learning. Therefore, the method and the device for identifying the voiceprint can solve the technical problems that the anti-noise capability of the existing voiceprint identification technology is poor, the time delay modeling capability is limited, and the identification result is lack of accuracy and reliability.

For ease of understanding, referring to fig. 2, the present application provides an embodiment of a neural network-based voiceprint recognition apparatus, comprising:

the model building module 201 is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in a series mode through an inner jump connection structure and an outer jump connection structure;

the model training module 202 is used for performing voiceprint recognition training on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model;

and the voiceprint recognition module 203 is configured to recognize the target voiceprint by using the target recognition network model to obtain a voiceprint recognition result.

Further, still include:

the preprocessing module 204 is configured to perform preprocessing operations on the training voiceprint information to obtain an audio frame to be processed, where the preprocessing operations include weighting, framing, and windowing;

a feature extraction module 205, configured to calculate an audio frame by using a mel filter based on a fourier transform algorithm to obtain MFCC features;

a training set constructing module 206, configured to construct a preset MFCC training set according to the MFCC characteristics.

Further, still include:

an extractor construction module 207 for constructing a semi-orthogonal decomposition feature extractor based on the plurality of semi-orthogonal convolution blocks;

an extractor training module 208, configured to perform voiceprint feature extraction training on the semi-orthogonal decomposition feature extractor according to a preset MFCC training set corresponding to training voiceprint information, to obtain a target voiceprint feature extractor;

and the extractor using module 209 is used for performing feature extraction on the newly added voiceprint through the target voiceprint feature extractor in the process of registering the voiceprint information, and storing the extracted voiceprint features in the database.

Further, still include:

the test module 210 is configured to perform a voiceprint recognition test on the target recognition network model by using a preset MFCC test set corresponding to the test voiceprint information to obtain a test result;

the optimization module 211 is configured to screen the target recognition network model according to the test result to obtain an optimized recognition network model;

correspondingly, the voiceprint recognition module 203 is specifically configured to:

The application also provides a voiceprint recognition device based on the neural network, and the device comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the neural network-based voiceprint recognition method in the above method embodiment according to the instructions in the program code.

The present application also provides a computer-readable storage medium for storing program code for executing the neural network-based voiceprint recognition method in the above-described method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for executing all or part of the steps of the methods described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A voiceprint recognition method based on a neural network is characterized by comprising the following steps:

2. The method for voiceprint recognition based on neural network according to claim 1, wherein the voiceprint recognition training is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and before the method, the method further comprises:

3. The method according to claim 1, wherein the training of voiceprint recognition is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and further comprising:

4. The method for voiceprint recognition based on neural network according to claim 1, wherein the voiceprint recognition training is performed on the semi-orthogonal decomposition neural network model according to a preset MFCC training set corresponding to training voiceprint information to obtain a target recognition network model, and then further comprising:

5. A voiceprint recognition apparatus based on a neural network, comprising:

the model building module is used for building a semi-orthogonal decomposition neural network model based on a plurality of semi-orthogonal convolution blocks, each semi-orthogonal convolution block comprises a plurality of semi-orthogonal one-dimensional convolution layers, and the semi-orthogonal one-dimensional convolution layers are connected in a series mode through an inner jump connection structure and an outer jump connection structure;

6. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:

7. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:

an extractor construction module for constructing a semi-orthogonal decomposition feature extractor based on the plurality of semi-orthogonal convolution blocks;

8. The neural network-based voiceprint recognition apparatus according to claim 5, further comprising:

9. A neural network-based voiceprint recognition apparatus, the apparatus comprising a processor and a memory;

the processor is configured to execute the neural network-based voiceprint recognition method of any one of claims 1-4 according to instructions in the program code.

10. A computer-readable storage medium for storing program code for performing the neural network-based voiceprint recognition method of any one of claims 1 to 4.