CN113870863A

CN113870863A - Voiceprint recognition method and device, storage medium and electronic equipment

Info

Publication number: CN113870863A
Application number: CN202111181601.3A
Authority: CN
Inventors: 沈浩; 赵德欣; 王磊; 曾然然; 林悦
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2021-12-31

Abstract

The present disclosure provides a voiceprint recognition method and apparatus, a storage medium, and an electronic device; relates to the technical field of artificial intelligence. The method comprises the following steps: determining a candidate unit structure; constructing a neural network model and training; adding a mask module into the neural network model to construct a generator; taking the trained neural network model as a baseline network, constructing and generating a confrontation network model through the baseline network, a generator and a discriminator, and training; determining intermediate state nodes to be cut and cutting according to the trained generated confrontation network model; after the neural network model after cutting is retrained, extracting the voiceprint characteristics of the target to be recognized; based on the similarity, determining a voiceprint recognition result. The method and the device can reduce the problems that computing resources are consumed in the searching process of the existing neural network architecture, and the number of network parameters in the searched network structure is large and the computation amount is large.

Description

Voiceprint recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a voiceprint recognition method, a voiceprint recognition apparatus, a computer-readable medium, and an electronic device.

Background

In recent years, Neural Network Architecture Search (NAS), a technique that can automatically Search for an optimal Neural network Architecture, has achieved performance exceeding that of a manually designed network Architecture in various tasks, such as image classification, semantic segmentation, object detection, and the like.

The conventional NAS method carries out preferred selection on candidate network architectures, but a large amount of computing resources are consumed; the searched network elements are simply spliced, and a large number of network parameters still exist in the whole network.

In the existing network architecture searching method, a large amount of computing resources are required to be consumed in the searching process, and the searched network structure has the problems of large network parameter and large computing amount.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The disclosed embodiments provide a voiceprint recognition method, a voiceprint recognition device, a computer readable medium, and an electronic device, so as to reduce the problem that computational resources are consumed in the existing neural network architecture search process to a certain extent, and the number of network parameters in the searched network structure is large, and the computation amount is large.

According to a first aspect of the present disclosure, there is provided a voiceprint recognition method comprising:

determining a candidate unit structure; constructing a neural network model based on the candidate unit structure and training the neural network model;

adding a mask module into the neural network model to construct a corresponding generator;

taking the trained neural network model as a baseline network, constructing a generated confrontation network model through the baseline network, the generator and the discriminator, and training the generated confrontation network model;

determining intermediate state nodes to be cut according to the trained generated confrontation network model;

cutting intermediate state nodes to be cut in the neural network model to obtain a neural network model with a simplified structure;

retraining the simplified structure neural network model, and extracting the voiceprint features of the target to be recognized by adopting the trained simplified structure neural network model to obtain a voiceprint feature vector of the target to be recognized;

and determining a voiceprint recognition result according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the label.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the determining a candidate unit structure; constructing a neural network model based on the candidate unit structures, including:

searching to obtain candidate unit structures, namely a Normal unit and a Reduction unit, by utilizing a gradient-based neural network architecture searching method;

alternately stacking the Normal unit and the Reduction unit according to the set number of the network units and a stacking rule to form a neural network main body framework;

and setting a classification layer behind the neural network main body architecture to obtain a neural network model.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the method further includes:

and sequentially arranging a pooling layer and a full-connection layer between the neural network main body architecture and the classification layer.

In an exemplary embodiment of the disclosure, based on the foregoing scheme, the adding a mask module to the neural network model to construct a corresponding generator includes:

and adding a mask module to the intermediate state nodes of the neural network model so as to add sparse mask values to the characteristic values of the intermediate state nodes to form a generator.

In an exemplary embodiment of the disclosure, based on the foregoing solution, the building a generation countermeasure network model through the baseline network, the generator and the arbiter includes:

marking the classification layer predicted value of the baseline network as a true label; marking the classification layer predicted value of the generator as a false label;

taking a multilayer perceptron network as a discriminator;

and performing two-class learning on the output of the baseline network and the output of the generator by using the discriminator to form a generated countermeasure network model.

In an exemplary embodiment of the disclosure, based on the foregoing scheme, the training the generation of the confrontation network model includes:

fixing the parameter values of the generator and the mask module, and training the generated confrontation network model by adopting training data so as to update the parameters of the discriminator;

and fixing the updated parameters of the discriminator, and training the generated confrontation network model by adopting training data so as to update the parameters of the generator and the mask module.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the determining an intermediate state node to be clipped according to a trained generative confrontation network model includes:

and determining the intermediate state node to be cut according to the mask value of the trained mask module for generating the intermediate state node of the confrontation network model.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the clipping an intermediate state node to be clipped in the neural network model includes:

respectively cutting the intermediate state node to be cut of each unit, the edge of the intermediate state node connected with the front node and the edge of the intermediate state node connected with the rear node to form a corresponding new unit structure;

and recombining the new unit structures according to the network architecture of the neural network model to obtain the neural network model with the simplified structure.

According to a second aspect of the present disclosure, there is provided a voiceprint recognition apparatus comprising:

the network model building module is used for determining a candidate unit structure; constructing a neural network model based on the candidate unit structure and training the neural network model;

the generator building module is used for adding a mask module into the neural network model and building a corresponding generator;

the generation confrontation network model building module is used for building a generation confrontation network model by taking the trained neural network model as a baseline network through the baseline network, the generator and the discriminator and training the generation confrontation network model;

the node to be cut determining module is used for determining an intermediate state node to be cut according to the trained generation confrontation network model;

the network structure determining module is used for cutting the intermediate state nodes to be cut in the neural network model to obtain a neural network model with a simplified structure;

the voiceprint feature extraction module is used for retraining the simplified structure neural network model, extracting the voiceprint features of the target to be recognized by adopting the trained simplified structure neural network model and obtaining the voiceprint feature vector of the target to be recognized;

and the recognition result determining module is used for determining the voiceprint recognition result according to the similarity between the voiceprint characteristic vector of the target to be recognized and the voiceprint characteristic vector with the label.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

Exemplary embodiments of the present disclosure may have some or all of the following benefits:

in the voiceprint recognition method provided by the disclosed example embodiment, a neural network model can be constructed and trained based on candidate unit structures searched by a neural network architecture; a generator is constructed, the trained neural network model is used as a baseline network, a generated confrontation network model is further constructed and trained, and according to the trained generated confrontation network model, intermediate state nodes to be cut are determined and node cutting is carried out; and (4) retraining the cut neural network model with the simplified structure, and carrying out voiceprint recognition on the voice to be recognized by using the trained network. On one hand, a better network architecture is obtained through a neural network architecture search algorithm, and the operation speed and the operation effect of the network are ensured; on the other hand, by generating an antagonistic network model and cutting the compressed neural network model, the size of the model can be greatly compressed without reducing the recognition effect, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting greatly reduces the number of network parameters and the amount of calculation, and improves the voiceprint recognition efficiency.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture to which a voiceprint recognition method and apparatus of the disclosed embodiments may be applied;

FIG. 2 schematically illustrates a flow diagram of a voiceprint recognition method according to one embodiment of the present disclosure;

FIG. 3 schematically shows a neural network model building flow diagram in accordance with one embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a structure of a candidate unit searched in one embodiment according to the present disclosure; wherein (a) is a structural schematic diagram of a Normal unit, and (b) is a structural schematic diagram of a Reduction unit;

FIG. 5 schematically illustrates a cell stacking approach in a neural network model according to one embodiment of the present disclosure;

FIG. 6 schematically shows a network architecture diagram of a neural network model in accordance with an embodiment of the present disclosure;

FIG. 7 schematically illustrates a generate confrontation network model build flow diagram in accordance with one embodiment of the present disclosure;

FIG. 8 schematically illustrates a structural diagram for generating a countermeasure network model in accordance with one embodiment of the present disclosure;

FIG. 9 schematically illustrates a cell structure diagram after a Normal cell crops an intermediate state node, according to one embodiment of the present disclosure;

FIG. 10 schematically illustrates a voiceprint feature vector acquisition flow diagram for a target to be identified in an embodiment according to the present disclosure;

FIG. 11 schematically illustrates spectrogram patterns of two different speakers in accordance with one embodiment of the present disclosure; wherein, each graph corresponds to a spectrogram of a speaker;

FIG. 12 schematically shows a graph of training correctness and testing correctness for different network models as a function of training turns in one embodiment in accordance with the present disclosure; wherein, (a) corresponds to VGG16, (b) corresponds to ResNet18, (c) corresponds to a conventional DARTS network, and (d) corresponds to a reduced architecture neural network of the present disclosure;

FIG. 13 schematically illustrates a block diagram of a voiceprint recognition apparatus in one embodiment according to the present disclosure;

FIG. 14 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of a system architecture 100 of an exemplary application environment to which a voiceprint recognition method and apparatus of an embodiment of the present disclosure may be applied. As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The voiceprint recognition method provided by the embodiment of the present disclosure can be executed in the server 105, and accordingly, a voiceprint recognition apparatus is generally disposed in the server 105. The voiceprint recognition method provided by the embodiment of the present disclosure may also be executed by the

terminal devices

101, 102, and 103, and correspondingly, the voiceprint recognition apparatus may also be disposed in the

terminal devices

101, 102, and 103.

The neural network comprises thousands of nodes and network parameters, and how to optimize the network structure and the network parameters is a technical difficulty in the field. The method is based on a Differentiable Architecture Search (DARTS) algorithm, a neural network model is constructed, redundant intermediate state nodes in a network are cut by using a generated countermeasure network model, the network operation effect is guaranteed, the network structure is compressed, the network parameter quantity is reduced, and the consumption of computing resources is reduced.

The technical solution of the embodiment of the present disclosure is explained in detail below:

referring to fig. 2, a voiceprint recognition method according to an example embodiment of the present disclosure may include:

step S210, determining a candidate unit structure; constructing a neural network model based on the candidate unit structure and training the neural network model;

step S220, adding a mask module into the neural network model and constructing a corresponding generator;

step S230, taking the trained neural network model as a baseline network; constructing a generated countermeasure network model through the baseline network, the generator and the discriminator, and training the generated countermeasure network model;

step S240, determining intermediate state nodes to be cut according to the trained generated confrontation network model;

and step S250, cutting the intermediate state nodes to be cut in the neural network model to obtain the neural network model with the simplified structure.

Step S260, retraining the neural network model with the simplified structure, and extracting the voiceprint characteristics of the target to be recognized by adopting the trained neural network model with the simplified structure to obtain the voiceprint characteristic vector of the target to be recognized;

and step S270, determining a voiceprint recognition result according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the label.

In the voiceprint recognition method provided by the present exemplary embodiment, in the voiceprint recognition method provided by the disclosed exemplary embodiment, a neural network model may be constructed and trained based on a candidate unit structure searched by a neural network architecture; taking the trained neural network model as a baseline network, constructing a generator, further constructing a generated confrontation network model, training the generated confrontation network model, determining intermediate state nodes to be cut according to the trained generated confrontation network model, and cutting the nodes; and (4) retraining the cut neural network model with the simplified structure, and carrying out voiceprint recognition on the voice to be recognized by using the trained network. On one hand, a better network architecture is obtained through a neural network architecture search algorithm, and the operation speed and the operation effect of the network are ensured; on the other hand, by generating an antagonistic network model and cutting the compressed neural network model, the size of the model can be greatly compressed without reducing the recognition effect, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting greatly reduces the number of network parameters and the amount of calculation, improves the voiceprint recognition efficiency, and further reduces the consumption of calculation resources and energy consumption.

Next, in another embodiment, the above steps are explained in more detail.

In step S210, a candidate unit structure is determined; and constructing a neural network model based on the candidate unit structure and training the neural network model.

In this example embodiment, the corresponding data set may be selected according to the classification task, and the training data may be selected from the existing voiceprint data set. Specifically, the neural network model is constructed through the following steps S310 to S330.

In step S310, candidate cell structures, i.e., Normal cell and Reduction cell, are searched by using a gradient-based neural network architecture search method. In the present exemplary embodiment, a gradient-based neural network architecture search method is described by taking DARTS as an example. First, softmax is applied on top of all candidate base operations (e.g., confluency or position, etc.) to relax the classification choice for a particular operation. Wherein softmax introduces weight information of candidate operations for calculating expected value of each layer output. When DARTS converges, only the operation with the relatively largest weight value is selected, remaining in the final model and other candidate operations will be deleted.

For example, the DARTS search algorithm is used in this example to determine two candidate cell structures that constitute the model architecture, one being a Normal cell, as shown in fig. 4(a), containing 0,1, 2, 3 for a total of 4 intermediate filler nodes. Normal Cell: convolution without changing the size of the input feature map. The other is a Reduction unit, and the result is shown in fig. 4(b), which includes 0,1, 2, and 3 for a total of 4 intermediate state nodes. Reduction cell: convolution for reducing the length and width of the input feature map to half of the original length and width respectively reduces the size by increasing the size of stride. As can be seen from fig. 4(a) and 4(b), candidate operations corresponding to edges formed by connecting the corresponding node with the front node and the rear node in the Normal cell and the Reduction cell are different and belong to different cell structures.

In step S320, the Normal unit and the Reduction unit are alternately stacked according to the set number of network units and the stacking rule, so as to form a neural network main body architecture.

In the present exemplary embodiment, the number of network elements, i.e., the network depth, may be set according to the classification task; the stacking rules may be determined based on the dimensions of the input data. The stacking rule may be that one Normal unit and one Reduction unit are stacked alternately to form a set number of network units, or that a plurality of Normal units and one Reduction unit are stacked alternately to form a set number of network units, or that a different number of Normal units and a different number of Reduction units are stacked alternately to form a set number of network units, that is, the number of units stacked alternately at each time is different, for example, the first stack is 2 Normal units and 1 Reduction unit, and the second stack is 3 Normal units and 2 Reduction units. Preferably, in this example, one Reduction unit is disposed at one third and two thirds of the network, and the rest are all set as Normal units, and the network unit splicing result is shown in fig. 5, where x is input data, and Z is Z in fig. 5^lThe output of the unit I, namely the characteristic extracted by the unit I; zⁿThe output of the nth cell.

In step S330, a classification layer is set after the neural network subject architecture, and a neural network model is obtained. In the present exemplary embodiment, a softmax classification layer may be set after the neural network body architecture formed by the Normal unit and the Reduction unit, and the extracted features are normalized and mapped to (0,1), which may be regarded as probability values for multi-classification. In addition, a convolution module can be added before the first unit, which can be used to reduce the dimension of the input features, so that DARTS can perform network architecture search on a single GPU, for example, a convolution operator with a step size of 2 and a kernel size of 3 is added before a dead unit, a neural network model of which is shown in fig. 6, and N in fig. 6 represents the number of Normal cells of the corresponding position.

In another embodiment of the present disclosure, according to actual needs, a pooling layer and a full connection layer may be sequentially disposed between the neural network body architecture and the classification layer to form a neural network model of the neural network body architecture, the pooling layer, the full connection layer, and the classification layer. The pooling layer in this example embodiment may be a maximum pooling layer for performing dimension reduction on the features extracted by the neural network subject architecture and increasing the invariance of the features, such as rotation invariance and offset invariance. The full-connection layer is used for integrating the features extracted by all the neurons to form a final feature vector. The classification layer is softmax and is used for carrying out normalized classification on the features.

Finally, training the constructed neural network model. In this example embodiment, the network parameters in the neural network model may be first randomly initialized; the data set can be selected according to the classification task, and can be divided into a training set and a test set according to requirements; training the initialized neural network model by using training data, and in the training process, performing iterative updating on network parameters by using a gradient descent algorithm and a cross entropy loss function. Whether training is complete may be determined by setting a threshold for the maximum training batch or the difference between two adjacent loss functions.

In step S220, a mask module is added to the neural network model to construct a corresponding generator. In this example embodiment, a masking module is added to the intermediate state nodes of the neural network model to add sparse mask values to the eigenvalues of the intermediate state nodes to form a generator. In this exemplary embodiment, a Mask module may be added to the intermediate state node of each cell in the base-line network, that is, a sparse Mask value (Soft Mask) is added to the eigenvalue of the intermediate state node of each cell, and a generator is formed, that is, the base-line network added with the Mask module is the generator.

In step S230, the trained neural network model is used as a baseline network; and constructing a generated countermeasure network model through the baseline network, the generator and the discriminator, and training the generated countermeasure network model. In this exemplary embodiment, the neural network model constructed in step S210 may be used as a baseline network, or the trained neural network model in any of the above embodiments may be used as a baseline network (i.e., a baseline network). Referring to fig. 7, a confrontation network model is constructed and trained by the following steps S710-740.

In step S710, the predicted value of the classification layer of the baseline network is marked as a true label; the classification layer prediction value of the generator is marked as a false label. In the present example embodiment, the softmax predictor of the baseline network is labeled as a true tag, and the softmax predictor of the generator is labeled as a false tag.

In step S720, the multi-layered perceptron network is used as the arbiter. A Multi-Layer Perceptron (MLP), also known as an Artificial Neural Network (ANN), includes an input Layer, an output Layer, and a plurality of hidden layers between the input Layer and the output Layer. Is a feedforward artificial neural network model that maps multiple data sets of an input onto a single data set of an output. In this example, the outputs of the baseline network and the generator may be input separately into the multi-layered perceptron network, mapping the two input data sets onto one output data set.

In step S730, the discriminator performs class-two learning on the output of the baseline network and the output of the generator, thereby forming a generated countermeasure network model. In this example embodiment, the baseline network and the generator are two parallel networks. The input data need to be respectively input into the two networks, the output data of the baseline network and the generator are input into a discriminator, and the discriminator judges the input data to output a judgment result. Taking the image data as an example, a generator g (x) is used to attempt to produce images similar to the training set. Discriminator d (x), which is a binary classifier, attempts to distinguish between pictures in the true training set x and false pictures generated by the generator. Thus, the generator is used to learn the distribution of data in x, thereby generating real images and ensuring that the recognizer cannot distinguish between real images in the training set and false images generated by the generator. Therefore, a confrontation network model is formed, the network structure of the confrontation network model is shown in fig. 8, and the training of the generator and the discriminator in the network is in a confrontation game state.

In step S740, an antagonistic network model is generated and trained. In the present exemplary embodiment, the generation of the antagonistic network model may be trained by the following steps.

Firstly, fixing the parameter values of the generator and the mask module, and training the discriminator by adopting training data so as to update the parameters of the discriminator. In this exemplary embodiment, the parameter values of the mask module and the discriminator are initialized randomly, and the parameter value of the mask module takes a value between [0 and 1 ]. After initialization, the network parameters of the generator and the parameter values of the mask module are fixed and unchanged, training data are respectively input into the baseline network and the generator, the discriminator performs true and false discrimination on the input of the generator, a Loss function (cross entropy Loss function MSE Loss) is calculated according to a discrimination result and a true label input by the baseline network, and the network parameters of the discriminator are updated, for example, the network parameters of the discriminator can be updated by adopting a gradient descent method. When the judgment result of the discriminator meets the requirement (for example, the set training round or precision is reached), the training of the discriminator is finished.

Then, fixing the updated parameters of the discriminator, and training the generator by adopting training data so as to update the parameters of the generator and the mask module. In the present exemplary embodiment, the network parameters of the trained arbiter may be fixed, the training data is input into the baseline network and the generator, and the parameters of the generator and the mask module are adjusted and updated by calculating the Loss function (cross entropy Loss function MSE Loss) of the output data of the generator, for example, the parameters may be updated by using a gradient descent method. And obtaining a trained generator until the training is finished.

In step S240, the intermediate state node to be clipped is determined according to the trained generated confrontation network model. In this example embodiment, the intermediate state node to be clipped may be determined according to a mask value of a trained mask module that generates the intermediate state node of the countermeasure network model. Namely, the intermediate state node to be clipped is determined according to the mask value of the mask module in the trained generator. For example, an intermediate state node whose mask value is close to 0 may be taken as the intermediate state node to be clipped. For example, an intermediate state node with a mask value at [0,0.1] is taken as the intermediate state node to be clipped. Preferably, the intermediate state node with the mask value of 0 is taken as the intermediate state node to be clipped.

In step S250, the intermediate state nodes to be cut in the neural network model are cut to obtain a reduced structure neural network model.

In the present exemplary embodiment, each unit in the neural network model is a directed acyclic graph composed of a sequence of a plurality of ordered nodes, each edge in the graph is composed of a plurality of candidate operations, and each node is a feature tensor; candidate operations may include max _ pool, skip _ connect, sep _ conv, dil _ conv, avg _ pool. As shown in fig. 4(a) and 4(b), a line with an arrow is a side.

After the intermediate state node to be clipped is determined, all the intermediate state node to be clipped, the edge connecting the intermediate state node with the front node, and the edge connecting the intermediate state node with the rear node in each unit structure in the network are clipped, and the unit structure formed by the remaining nodes and edges is used as a new unit structure, as shown in fig. 9, for the new unit structure after the intermediate state node is clipped, it can be obviously seen from the figure that the unit structure after clipping is reduced, and the candidate operation is obviously reduced. And finally, forming the neural network model with a simplified structure by the new unit structure according to the network architecture of the neural network model.

In step S260, the simplified structure neural network model is retrained, and the trained simplified structure neural network model is used to extract the voiceprint features of the target to be recognized, so as to obtain the voiceprint feature vector of the target to be recognized.

In the present exemplary embodiment, first, voice training data is acquired; the voice training data comprises a spectrogram extracted from the voice data and a label thereof; each spectrogram corresponds to a tag of a speaker. The speech training data may be the data from 100 speakers in the VoxCeleb1 dataset with the least difference in speaking duration. In a voiceprint dataset consisting of 100 speakers, the shortest speaking time of the speaker is 10 minutes, and the longest speaking time is 20 minutes. Cutting the voice of each speaker into a voice file at intervals, converting the voice file of the time period into a spectrogram, and adding a label to the spectrogram. The distinguishing performance of the voiceprints of the two speakers can be obviously embodied in the spectrogram characteristics, and the distinguishing input characteristics are more beneficial to the recognition of the neural network model.

And then, performing network parameter training on the simplified structure neural network model by adopting the voice training data, and performing iterative updating on the network parameters of the simplified structure neural network model by adopting a gradient descent algorithm and a cross entropy loss function in the training process.

In the present exemplary embodiment, referring to fig. 10, voiceprint feature extraction of the target to be recognized is realized by steps S1010 to 1040.

In step S1010, voice data to be recognized is acquired. In this example embodiment, the voice data to be recognized may be obtained through a voice recording function of the terminal device, may also be obtained through a recording function, and may also be voice played by other voice playing devices, which is not particularly limited in this example.

In step S1020, voice activity detection is performed on the voice data to be recognized, so as to obtain valid voice data. In the present exemplary embodiment, voice activity detection may be performed for each piece of speech by using the VAD algorithm to remove unnecessary information such as silence in the speech that is not meaningful for recognition.

In step S1030, a spectrogram of the valid speech data is extracted, and the spectrogram is used as a target to be recognized. In the present exemplary embodiment, a spectrogram may be extracted for each speaker's voice at intervals, which may be set as the case may be, for example, several seconds or ten and several seconds. Preferably, a frame length of 25 ms and a frame shift of 10 ms may be set. Thus, the target to be identified is obtained, namely the input data of the simplified structure neural network model is used as a spectrogram.

In step S1040, the voiceprint features of the target to be recognized are extracted through the trained candidate unit structure of the reduced structure neural network model, and the voiceprint features are integrated through the full connection layer, so as to obtain a voiceprint feature vector of the target to be recognized. In this exemplary embodiment, the spectrogram to be recognized performs corresponding feature extraction through a new unit structure of a trained neural network model with a simplified structure to obtain different extracted voiceprint features, and then synthesizes all the extracted features through a full connection layer to obtain a voiceprint feature vector of the target to be recognized, where the voiceprint feature vector includes voiceprint feature information extracted by each network unit.

In step S270, a voiceprint recognition result is determined according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the tag. In the present exemplary embodiment, the similarity may be a cosine similarity between two feature vectors. And when the executed task is voiceprint confirmation, calculating the cosine similarity between the voiceprint characteristic vector of the target to be recognized and the corresponding registered voiceprint characteristic vector, and when the similarity is higher than a preset threshold value, passing the current voiceprint verification, otherwise, rejecting the verification. When the task is voiceprint recognition, calculating the cosine similarity between the voiceprint feature vector of the target to be recognized and each voiceprint feature vector in the voiceprint library, selecting the feature vector with the highest similarity as a recognition result, and taking the speaker corresponding to the voiceprint feature vector as an identity recognition result.

For example, the voiceprint recognition effect of the present disclosure is illustrated by one specific embodiment.

First, a training data set is created, which uses the data of 100 speakers with the least difference in speaking duration in the VoxCeeb 1 data set. VoxCeleb1 is one of the more common text-independent voiceprint recognition source data sets at present, which is the sound extracted from YouTube video, and the format of speech is unified to 16kHz sampling rate and mono. In a voiceprint dataset consisting of 100 speakers, the shortest speaking time of the speaker is 10 minutes, and the longest speaking time is 20 minutes. The total speech duration was 1647 minutes, and the average speech duration per speaker was 16 minutes. The voice of each speaker is cut into a voice file every 3 seconds, then the cut voice file is converted into a spectrogram of 360dpi x 360dpi, the voice spectrogram characteristics can obviously reflect the differentiability of the voiceprints of two speakers, the input characteristics with the differentiability are more beneficial to the recognition of a neural network model, the voices of two different speakers are extracted, the extracted spectrogram is shown in figure 11, and the obvious difference can be seen from the two figures of figure 11.

The data set converted into spectrogram is divided into a training set and a testing set, wherein 30931 spectrogram images are arranged in the training set, and 310 spectrogram images are averaged for each speaker. And 20 spectrogram images of each speaker are fixed in the test set, and 2000 spectrogram images are obtained in total.

Because the extracted spectrogram features a special picture of 360dpi × 360dpi, and the special picture belongs to a large-size high-dimensional picture, a convolution operator with the step length of 2 and the kernel size of 3 is added before the first unit of the neural network model to reduce the dimension of the input spectrogram, so that the neural network model can search a network architecture on a single GPU.

Firstly, a gradient-based network architecture searching method is adopted on the established voiceprint data set, and two candidate unit structures of Normal and Reduction are obtained through searching, wherein the structures of the two unit structures are shown in fig. 4(a) and 4 (b). The number of unit modules in the network architecture search is set to 10, the initial channel number of the first unit is 16, and the number of intermediate state nodes of each unit is 4.

And splicing and training the 10 Cell structures obtained by searching according to a set rule, wherein the splicing rule is that the 4 th unit and the 7 th unit in the network are Reduction units, and the rest are Normal units. Taking the trained neural network model as a baseline network, determining intermediate nodes to be cut by generating a confrontation network model, and executing structural pruning of the neural network model, wherein the retention conditions of the intermediate state nodes of 10 units of the network after cutting are shown in table 1.

As can be seen from table 1, the reserved nodes of each cell are different, that is, the clipping nodes of each cell are different, that is, each cell needs to be separately considered in node clipping. FIG. 9 is a block diagram showing the unit No. 2 in Table 1.

TABLE 1 node Retention for units after neural network clipping

In addition, the present embodiment also performs model training and voiceprint recognition tests on the data sets by using different network models, respectively, and the results are shown in fig. 12 and table 2.

TABLE 2 Performance results of different network models on the voiceprint dataset

As can be seen from table 2, the test accuracy of the VGG16 network on the voiceprint data set is 87.30%, and as can be seen from fig. 12(a), the training accuracy of the VGG16 on the data set reaches 100%, but the training accuracy and the test accuracy are greater, which indicates that the overfitting condition is more serious. The test accuracy of ResNet18 also reached 88.95%, while the overfitting of ResNet18 network was the slightest but less accurate as seen in FIG. 12 (b). The correctness of the DARTS network spliced by 10 units under the existing unit structure reaches 94.05 percent, and as can be seen from figure 12(c), the overfitting condition is light. The testing accuracy of the simplified structure network of unit splicing after node cutting is 93.30%, and as can be seen from fig. 12(d), the overfitting condition is also lighter. However, as can be seen from table 2, the amount of parameters of the network is reduced by several million levels compared to other network models, and the accuracy drops very little compared to the existing DARTS network.

Specifically, the VGG16 has hundreds of millions of parameter quantities, and there are cases where training is abnormally slow in the network training process; the amount of parameters of ResNet18 is greatly reduced compared to VGG16, but there are tens of millions of parameters. The DARTS network obtained by direct splicing has 1.1 million parameters, while the parameter quantity of the network cut by the method is only 0.68 million, and the network parameter quantity is only 61.8 percent of the original DARTS network. 23 of 40 intermediate state nodes in the original DARTS network are cut, and only 17 intermediate state nodes are reserved, which shows that on a voiceprint identification data set, the intermediate state nodes are directly spliced in the original DARTS network according to the channel direction to serve as the output of one unit, and a plurality of redundant intermediate state nodes still can be generated. The experimental result of the embodiment shows that the network optimization method for extracting acoustic features by using a spectrogram and performing node clipping by using the method is effective on a voiceprint recognition task, and compared with an original DARTS algorithm, the method can reduce the parameter quantity of an original network by nearly 40% under the condition of not reducing the accuracy rate basically, and the structure is optimized.

According to the voiceprint recognition method provided by the disclosure, on one hand, a better network architecture is obtained through a neural network architecture search algorithm, and the operation speed and the operation effect of a network are ensured; on the other hand, by generating an antagonistic network model and cutting the compressed neural network model, the size of the model can be greatly compressed without reducing the voiceprint recognition effect, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting can greatly reduce the number of network parameters and the amount of calculation, and further reduce the consumption of calculation resources and energy consumption. In addition, the network after node cutting can be deployed among networks and on mobile terminal equipment, so that the application range is wide.

On one hand, the method searches out a better network unit structure by using a gradient-based neural network architecture search algorithm to construct a baseline network, and has higher speed and better effect compared with the traditional network model; on the other hand, the generated confrontation network model is adopted to cut and compress the network model, so that the model can confront and learn towards a well-trained baseline network, and the size of the model can be greatly compressed while the recognition effect is not reduced; furthermore, the unit structure after cutting greatly reduces the number of network parameters and the amount of calculation, can accelerate the forward reasoning speed, saves the calculation resources and reduces the energy consumption.

The simplified structure neural network model can be deployed on mobile equipment, and is low in energy consumption and high in practicability.

The neural network model with the simplified structure can also be used in the fields of image classification and the like.

Further, in the present exemplary embodiment, a voiceprint recognition apparatus 1300 is also provided. The voiceprint recognition apparatus 1300 can be applied to a server or a terminal device. Referring to fig. 7, the voiceprint recognition apparatus 1300 may include:

a network model building module 1310, which may be configured to determine candidate cell structures; constructing a neural network model based on the candidate unit structure and training the neural network model;

the producer building module 1320 may be configured to add a mask module to the base-line network structure to build a corresponding producer.

A generation confrontation network model construction module 1330, configured to use the trained neural network model as a baseline network; constructing a generated countermeasure network model through the baseline network, the generator and the discriminator, and training the generated countermeasure network model;

the node to be clipped determining module 1340 may be configured to determine an intermediate state node to be clipped according to the trained generated confrontation network model.

The network structure determining module 1350 may be configured to cut the intermediate state node to be cut in the neural network model, so as to obtain a neural network model with a simplified structure.

The voiceprint feature extraction module 1360 may be configured to retrain the reduced structure neural network model, extract a voiceprint feature of the target to be recognized by using the trained reduced structure neural network model, and obtain a voiceprint feature vector of the target to be recognized;

the recognition result determining module 1370 may be configured to determine a voiceprint recognition result according to a similarity between the voiceprint feature vector of the target to be recognized and the labeled voiceprint feature vector.

In an exemplary embodiment of the present disclosure, the network model building module 1310 includes:

and the unit structure searching module can be used for searching candidate unit structures, namely a Normal unit and a Reduction unit, by utilizing a gradient-based neural network architecture searching method.

And the unit stacking module can be used for alternately stacking the Normal unit and the Reduction unit according to the set number of the network units and the stacking rule to form a neural network main body framework.

And adding a classification module which can be used for setting a classification layer after the neural network main body architecture to obtain a neural network model.

A first training module may be used to train the neural network model.

In an exemplary embodiment of the disclosure, the generate confrontation network model construction module 1330 includes:

the label adding module can be used for marking the classification layer predicted value of the baseline network as a true label; the classification layer prediction value of the generator is marked as a false label.

And the discriminator constructing module can be used for taking the multilayer perceptron network as a discriminator.

And the confrontation network forming module can be used for performing two-class learning on the output of the baseline network and the output of the generator by utilizing the discriminator to form a generated confrontation network model.

A second training module may be configured to train the generative confrontation network model.

In an exemplary embodiment of the present disclosure, the second training module includes:

and the discriminator training module can be used for fixing the parameter values of the generator and the mask module and training the discriminator by adopting training data so as to update the parameters of the discriminator.

And the generator training module can be used for fixing the updated parameters of the discriminator and training the generator by adopting training data so as to update the parameters of the generator and the mask module.

In an exemplary embodiment of the present disclosure, the network structure determining module 1350 includes:

and the cutting module is used for respectively cutting the intermediate state node to be cut of each unit, the edge of the intermediate state node connected with the front node and the edge of the intermediate state node connected with the rear node to form a corresponding new unit structure.

And the combination module adopts a new unit structure to be recombined according to the network architecture of the neural network model to obtain the neural network model with the simplified structure.

In an exemplary embodiment of the disclosure, the voiceprint feature extraction module 1360 includes:

the voice acquisition module can be used for acquiring voice data to be recognized;

the activity detection module can be used for carrying out voice activity detection on the voice data to be recognized to obtain effective voice data;

the spectrogram extracting module can be used for extracting the spectrogram of the effective voice data and taking the spectrogram as a target to be identified;

the feature vector extraction module can be used for extracting the voiceprint features of the target to be recognized through the trained candidate unit structure of the simplified structure neural network model, and synthesizing the voiceprint features through the full connection layer to obtain the voiceprint feature vector of the target to be recognized.

In an exemplary embodiment of the present disclosure, the recognition result determining module 1370 includes:

and the similarity calculation module can be used for calculating the cosine similarity between the voiceprint characteristic vector of the target to be recognized and the voiceprint characteristic vector with the label.

And the identification module can be used for determining a voiceprint identification result according to the cosine similarity.

The specific details of each module or unit in the voiceprint recognition apparatus have been described in detail in the corresponding voiceprint recognition method, and therefore are not described herein again.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 2, 3, 7, 10, etc.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

It should be noted that the computer system 1400 of the electronic device shown in fig. 14 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 14, the computer system 1400 includes a Central Processing Unit (CPU)1401, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1402 or a program loaded from a storage portion 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data necessary for system operation are also stored. The CPU 1401, ROM 1402, and RAM 1403 are connected to each other via a bus 1404. An input/output (I/O) interface 1405 is also connected to bus 1404.

The following components are connected to the I/O interface 1405: an input portion 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage portion 1408 including a hard disk and the like; and a communication portion 1409 including a network interface card such as a LAN card, a modem, or the like. The communication section 1409 performs communication processing via a network such as the internet. The driver 1410 is also connected to the I/O interface 1405 as necessary. A removable medium 1411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1410 as necessary, so that a computer program read out therefrom is installed into the storage section 1408 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The computer program, when executed by a Central Processing Unit (CPU)1401, performs various functions defined in the methods and apparatus of the present application.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc., are all considered part of this disclosure.

It should be understood that the disclosure disclosed and defined in this specification extends to all alternative combinations of two or more of the individual features mentioned or evident from the text and/or drawings. All of these different combinations constitute various alternative aspects of the present disclosure. The embodiments of this specification illustrate the best mode known for carrying out the disclosure and will enable those skilled in the art to utilize the disclosure.

Claims

1. A voiceprint recognition method, comprising:

2. The voiceprint recognition method according to claim 1, wherein said determining a candidate unit structure; constructing a neural network model based on the candidate unit structures, including:

3. The voiceprint recognition method according to claim 2, wherein said method further comprises:

4. The voiceprint recognition method according to claim 1, wherein said adding a mask module to said neural network model, constructing a corresponding generator, comprises:

5. The voiceprint recognition method according to claim 4, wherein the constructing a generation countermeasure network model by the baseline network, the generator and the discriminator comprises:

taking a multilayer perceptron network as a discriminator;

6. The voiceprint recognition method of claim 5 wherein said training said generative confrontation network model comprises:

fixing the parameter values of the generator and the mask module, and training the discriminator by adopting training data to update the parameters of the discriminator;

and fixing the updated parameters of the discriminator, and training the generator by adopting training data so as to update the parameters of the generator and the mask module.

7. The voiceprint recognition method according to claim 4, wherein the determining the intermediate state node to be clipped according to the trained generative confrontation network model comprises:

8. The voiceprint recognition method according to claim 1, wherein the clipping the intermediate state node to be clipped in the neural network model comprises:

9. The voiceprint recognition method of claim 1 wherein said retraining a reduced structure neural network model comprises:

acquiring voice training data; the voice training data comprises a spectrogram extracted from the voice data and a label thereof;

and performing network parameter training on the simplified structure neural network model by adopting the voice training data, and performing iterative updating on the network parameters of the simplified structure neural network model by adopting a gradient descent algorithm and a cross entropy loss function in the training process.

10. The voiceprint recognition method according to claim 1, wherein the extracting the voiceprint features of the target to be recognized by using the trained neural network model with the reduced structure comprises:

acquiring voice data to be recognized;

carrying out voice activity detection on the voice data to be recognized to obtain effective voice data;

extracting a spectrogram of the effective voice data, and taking the spectrogram as a target to be recognized;

and extracting the voiceprint features of the target to be recognized through the trained candidate unit structure of the simplified structure neural network model, and synthesizing the voiceprint features through a full connection layer to obtain the voiceprint feature vector of the target to be recognized.

11. The voiceprint recognition method according to claim 1, wherein determining the voiceprint recognition result according to the similarity between the voiceprint feature vector of the target to be recognized and the labeled voiceprint feature vector comprises:

calculating the cosine similarity between the voiceprint characteristic vector of the target to be identified and the voiceprint characteristic vector with the label;

and determining a voiceprint recognition result according to the cosine similarity.

12. A voiceprint recognition apparatus comprising:

13. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1-11.

14. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-11.