CN113870863B

CN113870863B - Voiceprint recognition method and device, storage medium and electronic equipment

Info

Publication number: CN113870863B
Application number: CN202111181601.3A
Authority: CN
Inventors: 沈浩; 赵德欣; 王磊; 曾然然; 林悦
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2024-07-02
Anticipated expiration: 2041-10-11
Also published as: CN113870863A

Abstract

The disclosure provides a voiceprint recognition method and device, a storage medium and electronic equipment; relates to the technical field of artificial intelligence. The method comprises the following steps: determining candidate unit structures; constructing a neural network model and training; adding a mask module into the neural network model to construct a generator; taking the trained neural network model as a base line network, constructing and generating an countermeasure network model through the base line network, a generator and a discriminator, and training; determining intermediate state nodes to be cut according to the trained generation countermeasure network model and cutting; after retraining the cut neural network model, extracting voiceprint characteristics of a target to be identified; based on the similarity, a voiceprint recognition result is determined. The method and the device can reduce the problems that the existing neural network architecture searches consume computing resources, and the number of network parameters in the searched network structure is large and the operation amount is large.

Description

Voiceprint recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to a voiceprint recognition method, a voiceprint recognition apparatus, a computer readable medium, and an electronic device.

Background

In recent years, neural network architecture search (NAS, neural Architecture Search), which is a technology that can automatically search for an optimal neural network architecture, has achieved performance beyond that of a manually designed network architecture over a variety of tasks such as image classification, semantic segmentation, object detection, and the like.

The conventional NAS method performs preferred selection on candidate network architectures, but consumes a large amount of computing resources; the searched network units are simply spliced, and a large number of network parameters still exist in the whole network.

The existing network architecture searching method needs to consume a large amount of computing resources in the searching process, and the searched network structure has the problems of large network parameters and large computing capacity.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of an embodiment of the present disclosure is to provide a voiceprint recognition method, a voiceprint recognition device, a computer readable medium, and an electronic device, so as to reduce the problems of consuming computing resources in the searching process of the existing neural network architecture to a certain extent, and having large number of network parameters and large operation amount in the searched network structure.

According to a first aspect of the present disclosure, there is provided a voiceprint recognition method, including:

Determining candidate unit structures; constructing a neural network model based on the candidate unit structure and training the neural network model;

adding a mask module into the neural network model to construct a corresponding generator;

Constructing a generated countermeasure network model by taking the trained neural network model as a base line network through the base line network, the generator and the discriminator, and training the generated countermeasure network model;

determining intermediate state nodes to be cut according to the trained generation countermeasure network model;

Cutting the intermediate state nodes to be cut in the neural network model to obtain a neural network model with a simplified structure;

retraining the simplified structure neural network model, and extracting voiceprint features of the target to be identified by adopting the trained simplified structure neural network model to obtain a voiceprint feature vector of the target to be identified;

and determining a voiceprint recognition result according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the tag.

In an exemplary embodiment of the disclosure, the determining the candidate cell structure is based on the foregoing scheme; based on the candidate unit structure, constructing a neural network model, including:

Searching to obtain candidate unit structures, namely Normal units and Reduction units by using a neural network architecture searching method based on gradients;

according to the number of the set network units and the stacking rule, alternately stacking the Normal units and the Reduction units to form a neural network main body framework;

And setting a classification layer after the neural network main body framework to obtain a neural network model.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the method further includes:

And a pooling layer and a full-connection layer are sequentially arranged between the neural network main body framework and the classification layer.

In an exemplary embodiment of the disclosure, based on the foregoing solution, the adding a mask module to the neural network model constructs a corresponding generator, including:

And adding a mask module to the intermediate state node of the neural network model so as to enable the characteristic value of the intermediate state node to be added with a sparse mask value to form a generator.

In an exemplary embodiment of the disclosure, based on the foregoing solution, the building by the base line network, the generator, and the arbiter generates an countermeasure network model, including:

marking the classification layer predicted value of the base line network as a true label; marking the classification layer predicted value of the generator as a false label;

taking the multi-layer sensor network as a discriminator;

And performing classification learning on the output of the base line network and the output of the generator by using the discriminator to form a generated countermeasure network model.

In an exemplary embodiment of the disclosure, based on the foregoing solution, the training the generating the countermeasure network model includes:

fixing the parameter values of the generator and the mask module, and training the generated countermeasure network model by adopting training data so as to update the parameters of the discriminator;

And fixing the parameters of the updated discriminators, and training the generated countermeasure network model by using training data so as to update the parameters of the generator and the mask module.

In an exemplary embodiment of the present disclosure, based on the foregoing solution, the determining, according to the trained generated countermeasure network model, an intermediate state node to be clipped includes:

and determining the intermediate state node to be cut according to the trained mask value of the mask module for generating the intermediate state node of the countermeasure network model.

In an exemplary embodiment of the disclosure, based on the foregoing solution, the clipping the intermediate state node to be clipped in the neural network model includes:

Cutting the intermediate state node to be cut of each unit, the edge connected with the intermediate state node and the front node and the edge connected with the intermediate state node and the rear node respectively to form a corresponding new unit structure;

and adopting a new unit structure to recombine according to the network architecture of the neural network model, thereby obtaining the neural network model with a simplified structure.

According to a second aspect of the present disclosure, there is provided a voiceprint recognition apparatus comprising:

The network model building module is used for determining candidate unit structures; constructing a neural network model based on the candidate unit structure and training the neural network model;

the generator construction module is used for adding a mask module into the neural network model to construct a corresponding generator;

The generated countermeasure network model construction module is used for constructing a generated countermeasure network model by taking the trained neural network model as a base line network through the base line network, the generator and the discriminator, and training the generated countermeasure network model;

the node to be cut determining module is used for determining an intermediate state node to be cut according to the trained generation countermeasure network model;

The network structure determining module is used for cutting the intermediate state nodes to be cut in the neural network model to obtain a neural network model with a simplified structure;

The voiceprint feature extraction module is used for retraining the simplified structure neural network model, extracting voiceprint features of the target to be identified by adopting the trained simplified structure neural network model, and obtaining a voiceprint feature vector of the target to be identified;

And the recognition result determining module is used for determining a voiceprint recognition result according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the tag.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.

Exemplary embodiments of the present disclosure may have some or all of the following advantages:

In the voiceprint recognition method provided by the disclosed example embodiment, a neural network model may be constructed and trained based on candidate unit structures searched by a neural network architecture; the method comprises the steps of constructing a generator, further constructing a generated countermeasure network model by taking a trained neural network model as a base line network, training the generated countermeasure network model, determining intermediate state nodes to be cut according to the trained generated countermeasure network model, and cutting the nodes; retraining the cut neural network model with the simplified structure, and carrying out voiceprint recognition on the voice to be recognized by using the trained network. On one hand, a better network architecture is obtained through a neural network architecture search algorithm, and the running speed and the running effect of the network are ensured; on the other hand, by generating the countermeasure network model to cut the compressed neural network model, the size of the model can be greatly compressed while the recognition effect is not reduced, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting greatly reduces the number of network parameters and the operation amount, and improves the voiceprint recognition efficiency.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which a voiceprint recognition method and apparatus of embodiments of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow chart of a voiceprint recognition method according to one embodiment of the present disclosure;

FIG. 3 schematically illustrates a neural network model building flow diagram in accordance with one embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of candidate cell structures searched in accordance with one embodiment of the present disclosure; wherein (a) is a structural schematic diagram of a Normal unit, and (b) is a structural schematic diagram of a Reduction unit;

FIG. 5 schematically illustrates a schematic diagram of a cell stacking scheme in a neural network model in accordance with one embodiment of the present disclosure;

FIG. 6 schematically illustrates a network architecture diagram of a neural network model in one embodiment in accordance with the present disclosure;

FIG. 7 schematically illustrates a flow diagram for generating an countermeasure network model build in accordance with one embodiment of the present disclosure;

FIG. 8 schematically illustrates a structural diagram of generating an countermeasure network model in accordance with one embodiment of the present disclosure;

FIG. 9 schematically illustrates a cell structure diagram after Normal cell clipping of intermediate state nodes in accordance with one embodiment of the present disclosure;

FIG. 10 schematically illustrates a voiceprint feature vector acquisition flow diagram of an object to be identified in one embodiment in accordance with the present disclosure;

FIG. 11 schematically illustrates a spectrogram of two different speakers in one embodiment according to the present disclosure; wherein each figure corresponds to a spectrogram of a speaker;

FIG. 12 schematically illustrates a graph of training accuracy and test accuracy for different network models as a function of training round in accordance with one embodiment of the present disclosure; wherein, (a) corresponds to VGG16, (b) corresponds to ResNet, (c) corresponds to a traditional DARTS network, (d) corresponds to a reduced structure neural network of the present disclosure;

FIG. 13 schematically illustrates a block diagram of a voiceprint recognition device in one embodiment in accordance with the present disclosure;

fig. 14 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 shows a schematic diagram of a system architecture 100 of an exemplary application environment in which a voiceprint recognition method and apparatus of embodiments of the present disclosure may be applied. As shown in fig. 1, the system architecture 100 may include one or more of the terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 101, 102, 103 may be various electronic devices with display screens including, but not limited to, desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The voiceprint recognition method provided by the embodiments of the present disclosure may be executed in the server 105, and accordingly, the voiceprint recognition apparatus is generally disposed in the server 105. The voiceprint recognition method provided by the embodiment of the present disclosure may also be executed by the terminal devices 101, 102, 103, and correspondingly, the voiceprint recognition apparatus may also be disposed in the terminal devices 101, 102, 103.

It is a technical difficulty in this field how to optimize the network structure and network parameters, which contain thousands of nodes and network parameters in the neural network. The method and the device construct a neural network model based on a differential network architecture search (DARTS, differentiable Architecture Search) algorithm, cut redundant intermediate state nodes in the network by utilizing the generated countermeasure network model, compress the network structure while guaranteeing the network operation effect, reduce the network parameter quantity and reduce the calculation resource consumption.

The following describes the technical scheme of the embodiments of the present disclosure in detail:

referring to fig. 2, a voiceprint recognition method according to an exemplary embodiment provided by the present disclosure may include the steps of:

Step S210, determining candidate unit structures; constructing a neural network model based on the candidate unit structure and training the neural network model;

Step S220, adding a mask module into the neural network model to construct a corresponding generator;

step S230, taking the trained neural network model as a base line network; constructing a generated countermeasure network model through the base line network, the generator and the discriminator, and training the generated countermeasure network model;

Step S240, determining intermediate state nodes to be cut according to the trained generation countermeasure network model;

And step S250, clipping the intermediate state nodes to be clipped in the neural network model to obtain the neural network model with the simplified structure.

Step S260, retraining the simplified structure neural network model, and extracting voiceprint features of the target to be identified by adopting the trained simplified structure neural network model to obtain a voiceprint feature vector of the target to be identified;

Step S270, determining a voiceprint recognition result according to the similarity between the voiceprint feature vector of the object to be recognized and the voiceprint feature vector with the tag.

In the voiceprint recognition method provided by the present exemplary embodiment, in the voiceprint recognition method provided by the disclosed exemplary embodiment, a neural network model may be constructed and trained based on candidate unit structures searched by a neural network architecture; taking the trained neural network model as a base line network, constructing a generator, further constructing a generated countermeasure network model, training the generated countermeasure network model, determining intermediate state nodes to be cut according to the trained generated countermeasure network model, and cutting the nodes; retraining the cut neural network model with the simplified structure, and carrying out voiceprint recognition on the voice to be recognized by using the trained network. On one hand, a better network architecture is obtained through a neural network architecture search algorithm, and the running speed and the running effect of the network are ensured; on the other hand, by generating the countermeasure network model to cut the compressed neural network model, the size of the model can be greatly compressed while the recognition effect is not reduced, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting greatly reduces the number of network parameters and the operation amount, improves the voiceprint recognition efficiency, and further reduces the consumption of computing resources and energy consumption.

In another embodiment, the above steps are described in more detail below.

In step S210, a candidate cell structure is determined; and constructing a neural network model based on the candidate unit structure and training the neural network model.

In this example embodiment, the corresponding data set may be selected according to the classification task, and the training data may be selected from the existing voiceprint data set. Specifically, the neural network model is constructed through the following steps S310 to S330.

In step S310, candidate cell structures, that is, normal cells and Reduction cells, are searched for by using a gradient-based neural network architecture search method. In the present exemplary embodiment, a gradient-based neural network architecture search method is described taking DARTS as an example. First, softmax is applied over all candidate base operations (e.g., content or pooling, etc.) to relax the classification selection for a particular operation. Wherein softmax incorporates weighting information for candidate operations for calculating expected values for each layer output. Upon DARTS convergence, only the operation with the relative maximum weight value is selected, remaining in the final model, and other candidate operations will be deleted.

For example, the DARTS search algorithm is used in this example to determine two candidate cell structures that make up the model architecture, one being a Normal cell, as shown in FIG. 4 (a), containing a total of 4 intermediate filling nodes of 0,1, 2, 3. Normal Cell: convolution does not change the size of the input feature map. The other is a Reduction unit, and the result is shown in fig. 4 (b), which contains 0,1, 2, and 3 total 4 intermediate state nodes. Reduction cell: convolution, which reduces the length and width of the input feature map to half of the original value, reduces the size by increasing the size of the stride. As can be seen from fig. 4 (a) and 4 (b), candidate operations corresponding to edges formed by connecting corresponding nodes in the Normal unit and the Reduction unit with the front node and the rear node are different, and belong to different unit structures.

In step S320, according to the set number of network elements and the stacking rule, the Normal elements and the Reduction elements are alternately stacked to form a neural network main structure.

In the present exemplary embodiment, the number of network elements, i.e., the network depth, may be set according to the classification task; the stacking rule may be determined from the dimensions of the input data. The stacking rule may be that one Normal unit and one Reduction unit are alternately stacked to form the set network unit number, or that a plurality of Normal units and one Reduction unit are alternately stacked to form the set network unit number, or that a different number of Normal units and a different number of Reduction units are alternately stacked to form the set network unit number, that is, the number of units alternately stacked each time is different, for example, the first alternate stack is 2 Normal units and 1 Reduction unit, and the second alternate stack is 3 Normal units and 2 Reduction units. Preferably, in this example, one Reduction unit is set at one third and two thirds of the network, and the rest are all Normal units, where the splicing result of the network units is shown in fig. 5, x is input data, and Z ^l is output of the first unit, that is, features extracted by the first unit; the output of the nth cell of Z ⁿ.

In step S330, a classification layer is set after the neural network main architecture, so as to obtain a neural network model. In this example embodiment, a softmax classification layer may be set after the neural network body architecture formed by Normal units and Reduction units, and the extracted features are normalized and mapped between (0, 1), which may then be regarded as probability values for multiple classifications. In addition, a convolution module can be added before the first unit, so that the dimension of the input feature can be reduced, and DARTS can search a network architecture on a single GPU, for example, a convolution operator with a step length of 2 and a kernel size of 3 is added before one unit is dead, a neural network model is shown in fig. 6, and N in fig. 6 represents the number of Normal cells at the corresponding position.

In another embodiment of the present disclosure, a pooling layer and a full-connection layer may be sequentially disposed between the neural network main body architecture and the classification layer according to actual needs, to form a neural network model of the neural network main body architecture, the pooling layer, the full-connection layer, and the classification layer. The pooling layer in this example embodiment may be a maximum pooling layer for dimension reduction of features extracted by the neural network body architecture and increasing invariance of the features, such as rotational invariance and offset invariance. The full connection layer is used for integrating the characteristics extracted by all the neurons to form a final characteristic vector. The classification layer is softmax and is used for normalized classification of the features.

Finally, training the constructed neural network model is further included. In this example embodiment, network parameters in the neural network model may be first randomly initialized; the data set can be selected according to the classification task, and can be divided into a training set and a testing set according to the requirement; training the initialized neural network model by using training data, and performing iterative update of network parameters by using a gradient descent algorithm and a cross entropy loss function in the training process. Whether training is complete may be determined by setting a threshold for the maximum training batch or the difference between two adjacent loss functions.

In step S220, a mask module is added to the neural network model, and a corresponding generator is constructed. In this example embodiment, a mask module is added to intermediate state nodes of the neural network model, such that sparse mask values are added to the eigenvalues of the intermediate state nodes to form a generator. In this example embodiment, a Mask module may be added to the intermediate state node of each cell in the base network, i.e., a sparse Mask value (Soft Mask) may be added to the feature value of the intermediate state node of each cell, to form a generator, i.e., the base network to which the Mask module is added is the generator.

In step S230, the trained neural network model is used as a base line network; and constructing a generated countermeasure network model through the base line network, the generator and the discriminator, and training the generated countermeasure network model. In this exemplary embodiment, the neural network model constructed in step S210 may be used as a base line network, or the neural network model trained in any of the above embodiments may be used as a base line network (i.e., a baseline network). Referring to fig. 7, an countermeasure network model is constructed and trained through the following steps S710-740.

In step S710, the classification layer prediction value of the base line network is marked as a true label; the classification layer predictor of the generator is marked as a false label. In the present example embodiment, the softmax predictor of the baseline network is labeled as a true label, and the softmax predictor of the generator is labeled as a false label.

In step S720, the multi-layered sensor network is used as a discriminator. A Multi-Layer Perceptron (MLP), also known as an artificial neural network (ARTIFICIAL NEURAL NETWORK, ANN), includes an input Layer, an output Layer, and a plurality of hidden layers between the input Layer and the output Layer. Is a feed-forward artificial neural network model that maps multiple data sets of an input onto a single data set of an output. In this example, the outputs of the base line network and the generator may be input into the multi-layer perceptron network, respectively, such that the two input data sets are mapped onto one output data set.

In step S730, the output of the base line network and the output of the generator are subjected to classification learning by the discriminator to form a generated countermeasure network model. In the present example embodiment, the base line network and the generator are two parallel networks. The input data are required to be input into the two networks respectively, the output data of the base line network and the generator are input into a discriminator, the discriminator performs true and false discrimination on the input data, and a discrimination result is output. Taking the image data as an example, the generator G (x) is used to try to produce an image similar to the training set. The discriminator D (x), a binary classifier, attempts to distinguish between pictures in the true training set x and false pictures generated by the generator. Thus, the generator is used to learn the distribution of the data in x, thereby generating a true image, and ensuring that the recognizer cannot distinguish between a true image in the training set and a false image generated by the generator. The network structure of the generated countermeasure network model is shown in fig. 8, and training of the generator and the discriminator in the network is in a countermeasure game state.

In step S740, the generated countermeasure network model is trained. In this example embodiment, the generation of the countermeasure network model may be trained by the following steps.

Firstly, fixing parameter values of the generator and the mask module, and training the discriminator by adopting training data so as to update parameters of the discriminator. In this exemplary embodiment, the parameter values of the mask module and the arbiter are randomly initialized, and the parameter value of the mask module is between [0,1 ]. After initialization, the network parameters of the generator and the parameter values of the mask module are fixed, training data are respectively input into the base line network and the generator, the discriminator performs true and false discrimination on the input of the generator, a Loss function (cross entropy Loss function MSE Loss) is calculated according to the discrimination result and the true label input into the base line network, and the network parameters of the discriminator are updated, for example, the network parameters of the discriminator can be updated by adopting a gradient descent method. When the judging result of the judging device meets the requirement (for example, the set training round or the precision is reached), the training of the judging device is finished.

And then, fixing the parameters of the updated discriminant, and training the generator by using training data to update the parameters of the generator and the mask module. In this exemplary embodiment, the network parameters of the trained discriminators may be fixed, the training data is input into the base line network and the generator, and the parameters of the generator and the mask module are adjusted and updated by calculating a Loss function (cross entropy Loss function MSE Loss) of the output data of the generator, for example, the parameters may be updated by a gradient descent method. And obtaining a trained generator until the training is finished.

In step S240, the intermediate state node to be cut is determined according to the trained generated countermeasure network model. In this example embodiment, the intermediate state node to be clipped may be determined from the mask values of the trained mask module that generated the intermediate state node against the network model. And determining the intermediate state node to be cut according to the mask value of the mask module in the trained generator. For example, an intermediate state node with a mask value close to 0 may be regarded as the intermediate state node to be clipped. For example, the intermediate state node with mask value at [0,0.1] is taken as the intermediate state node to be clipped. Preferably, an intermediate state node with a mask value of 0 is used as the intermediate state node to be cut.

In step S250, the intermediate state node to be cut in the neural network model is cut, so as to obtain a neural network model with a simplified structure.

In the present exemplary embodiment, each element in the neural network model is a directed acyclic graph composed of a sequence of a plurality of ordered nodes, each edge in the graph is composed of a plurality of candidate operations, and each node is a feature tensor; candidate operations may include max_pool, skip_connect, sep_conv, dil_conv, avg_pool. The line of one arrowhead in fig. 4 (a) and 4 (b) is one side.

After determining the intermediate state node to be cut, cutting the intermediate state node to be cut, the edge of the intermediate state node connected with the front node and the edge of the intermediate state node connected with the rear node in each unit structure in the network, and taking the unit structure formed by the rest nodes and the edges as a new unit structure, wherein the new unit structure is a new unit structure after cutting the intermediate state node, as shown in fig. 9, the unit structure after cutting is obviously reduced, and the candidate operation is obviously reduced. And finally, forming the neural network model with a simplified structure by the new unit structure according to the network architecture of the neural network model.

In step S260, retraining the simplified structure neural network model, and extracting voiceprint features of the target to be identified by using the trained simplified structure neural network model to obtain a voiceprint feature vector of the target to be identified.

In the present exemplary embodiment, first, voice training data is acquired; the voice training data comprises a spectrogram and a label thereof extracted from the voice data; each of the spectrograms corresponds to a tag of a speaker. The speech training data may be data of 100 speakers with minimum speaker-duration differences in VoxCeleb1 datasets. In a voiceprint dataset of 100 speakers, the shortest speaker has a speaking duration of 10 minutes and the longest speaker has a speaking duration of 20 minutes. Cutting the voice of each speaker into a voice file at intervals, converting the voice file in the time period into a spectrogram, and adding a label to the spectrogram. The characteristic of the spectrogram can obviously reflect the distinguishing property of the voiceprints of two speakers, and the input characteristic with the distinguishing property is more beneficial to the recognition of the neural network model.

And then, training network parameters of the neural network model with the simplified structure by adopting the voice training data, and iteratively updating the network parameters of the neural network model with the simplified structure by adopting a gradient descent algorithm and a cross entropy loss function in the training process.

In the present exemplary embodiment, referring to fig. 10, voiceprint feature extraction of an object to be identified is achieved through steps S1010-1040.

In step S1010, voice data to be recognized is acquired. In this example embodiment, the voice data to be recognized may be obtained through a voice input function of the terminal device, or may be obtained through a recording function, or may be a voice played by another voice playing device, which is not limited in this example.

In step S1020, voice activity detection is performed on the voice data to be recognized, so as to obtain valid voice data. In this exemplary embodiment, voice activity detection may be performed on each voice by using a VAD algorithm to remove unnecessary information such as silence that is not significant for recognition in the voice.

In step S1030, a spectrogram of the valid voice data is extracted, and the spectrogram is used as a target to be recognized. In the present exemplary embodiment, a spectrogram may be extracted for each speaker's voice at intervals of time, which may be set according to the specific situation, for example, may be set to several seconds or ten or more seconds. Preferably, the frame length can be set to 25 milliseconds and the frame shift to 10 milliseconds. The input data of the target to be identified, namely the neural network model with the simplified structure, is obtained as a spectrogram.

In step S1040, the voiceprint features of the object to be identified are extracted through the candidate unit structure of the trained neural network model with simplified structure, and the voiceprint features are synthesized through the full connection layer to obtain the voiceprint feature vector of the object to be identified. In this exemplary embodiment, the spectrogram to be identified performs corresponding feature extraction through a new unit structure of the trained neural network model with a simplified structure to obtain extracted different voiceprint features, and synthesizes all the extracted features through the full connection layer to obtain a voiceprint feature vector of the object to be identified, where the voiceprint feature vector includes voiceprint feature information extracted by each network unit.

In step S270, a voiceprint recognition result is determined according to the similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the tag. In this example embodiment, the similarity may be a cosine similarity between two feature vectors. When the execution task is voiceprint confirmation, calculating cosine similarity between the voiceprint feature vector of the object to be identified and the corresponding registered voiceprint feature vector, and when the similarity value is higher than a preset threshold value, passing current voiceprint verification, otherwise rejecting. When the execution task is voiceprint recognition, the cosine similarity between the voiceprint feature vector of the object to be recognized and each voiceprint feature vector in the voiceprint library is calculated, then the feature vector with the highest similarity is selected as a recognition result, and the speaker corresponding to the voiceprint feature vector is taken as an identity recognition result.

For example, the voiceprint recognition effect of the present disclosure is illustrated by one particular embodiment.

First, a training dataset was created using data of 100 speakers with minimum differences in speaking duration in VoxCeleb datasets. VoxCeleb1 is one of the more common text-independent voiceprint recognition open source datasets currently available, which are sound extracted from YouTube video, the speech format being unified to 16kHz sampling rate, mono. In a voiceprint dataset of 100 speakers, the shortest speaker has a speaking duration of 10 minutes and the longest speaker has a speaking duration of 20 minutes. The total speech duration was 1647 minutes, and the average speaker duration for each speaker was 16 minutes. The voice of each speaker is cut into a voice file every 3 seconds, the cut voice file is converted into a spectrogram of 360dpi multiplied by 360dpi, the distinguishability of the voice prints of two speakers can be obviously embodied in the spectrogram characteristics, the input characteristics with the distinguishability are more beneficial to the recognition of a neural network model, the voices of two different speakers are extracted, the extracted spectrogram is shown in fig. 11, and the obvious difference can be seen from the two diagrams of fig. 11.

The data set converted into the spectrograms is divided into a training set and a test set, wherein 30931 spectrogram pictures are arranged in the training set, and 310 spectrograms of each speaker are averaged. And 20 spectrograms of each speaker are fixed in the test set, and a total of 2000 spectrogram pictures are obtained.

Because the extracted spectrogram features are special pictures of 360dpi multiplied by 360dpi, the special pictures belong to large-size high-dimensional pictures, a convolution operator with the step length of 2 and the core size of 3 is added before the first unit of the neural network model and is used for reducing the dimension of an input spectrogram, so that the neural network model can search a network architecture on a single GPU.

And firstly, searching to obtain two candidate unit structures of Normal and Reduction by adopting a network architecture searching method based on gradient on the established voiceprint data set, wherein the two unit structures are shown in fig. 4 (a) and 4 (b). The number of unit modules at the time of network architecture search is set to 10, the initial channel number of the first unit is 16, and the number of intermediate state nodes of each unit is set to 4.

And splicing and training the 10 Cell structures obtained by searching according to a set rule, wherein the splicing rule is that the 4 th and 7 th units in the network are Reduction units, and the rest are Normal units. And taking the trained neural network model as a baseline network, determining intermediate nodes to be cut through generating an countermeasure network model, and executing structured pruning of the neural network model, wherein the retention condition of the intermediate state nodes of 10 units of the network after cutting in the example is shown in table 1.

It can be seen from table 1 that the reserved node situation is different for each unit, that is, the clipping node is different for each unit, that is, each unit needs to be individually considered in node clipping. Fig. 9 is a block diagram of the unit No. 2 in table 1.

TABLE 1 node reservation of units after neural network clipping

In addition, in this embodiment, model training and voiceprint recognition testing are performed on the data set by using different network models, and the results are shown in fig. 12 and table 2.

Table 2 performance results for different network models on voiceprint datasets

As can be seen from table 2, the test accuracy of the VGG16 network is 87.30% on the voiceprint data set, and as can be seen from fig. 12 (a), the training accuracy of the VGG16 on the data set reaches 100%, but the training accuracy and the test accuracy are relatively large, which indicates that the fitting condition is serious. Whereas the test accuracy of ResNet < 18 > also reached 88.95%, it can be seen from fig. 12 (b) that the over-fitting of the ResNet < 18 > network was the slightest, but the test accuracy was lower. The accuracy of the DARTS network spliced by 10 units under the existing unit structure reaches 94.05%, and as can be seen from fig. 12 (c), the overfitting condition is lighter. The test accuracy of the unit spliced simple structure network after node cutting in the present disclosure also reaches 93.30%, and as can be seen from fig. 12 (d), the over-fitting condition is also lighter. However, as can be seen from table 2, the parameters of the network are reduced by millions of levels relative to other network models, and the accuracy drop is very small compared to existing DARTS networks.

Specifically, VGG16 has hundreds of millions of parameter amounts, and there is a situation in which training is abnormally slow during network training; the amount of ResNet parameters is greatly reduced compared with the VGG16, but tens of millions of parameter amounts still exist. DARTS network obtained by direct splicing has 1.1 million parameters, but the parameter quantity of the network cut by the method is only 0.68 million, and the network parameter quantity is only 61.8% of the original DARTS network. The original DARTS network is cut out to 23 out of 40 intermediate state nodes, and only 17 intermediate state nodes are reserved, which means that on a voiceprint recognition data set, the intermediate state nodes are spliced directly according to the channel direction in the original DARTS network to serve as the output of a unit, and a plurality of redundant intermediate state nodes still can be generated. The experimental result of the embodiment shows that the network optimization method for extracting acoustic features by using the spectrogram on the voiceprint recognition task and performing node clipping by adopting the method is effective, and compared with the original DARTS algorithm, the method can reduce the parameter quantity of the original network by nearly 40% under the condition of basically not reducing the accuracy, and optimize the structure.

According to the voiceprint recognition method provided by the disclosure, on one hand, a relatively good network architecture is obtained through a neural network architecture search algorithm, and the running speed and the running effect of the network are ensured; on the other hand, by generating the countermeasure network model to cut the compressed neural network model, the size of the model can be greatly compressed while the voiceprint recognition effect is not reduced, and the consumption of computing resources is reduced; furthermore, the new unit structure obtained by cutting can greatly reduce the number of network parameters and the operation amount, and further reduce the consumption of computing resources and energy consumption. In addition, the network after node clipping can be deployed between the network and the mobile terminal equipment, so that the application range is very wide.

On one hand, the method searches out a better network element structure by using a gradient-based neural network architecture search algorithm, so that a base line network is constructed, and compared with a traditional network model, the method has higher speed and better effect; on the other hand, the model is cut to compress the network model by adopting the generated countermeasure network model, so that the model can be subjected to countermeasure learning towards a well-trained base line network, and the size of the model can be greatly compressed while the recognition effect is not reduced; the unit structure after clipping greatly reduces the number of network parameters and the operation amount, can accelerate forward reasoning speed, saves calculation resources and reduces energy consumption.

The neural network model with the simplified structure can be deployed on mobile equipment, and is low in energy consumption and high in practicability.

The reduced structure neural network model of the present disclosure may also be used in the fields of image classification, etc.

Further, in the present exemplary embodiment, a voiceprint recognition apparatus 1300 is also provided. The voiceprint recognition apparatus 1300 can be applied to a server or terminal device. Referring to fig. 7, the voiceprint recognition apparatus 1300 may include:

A network model construction module 1310, which may be used to determine candidate cell structures; constructing a neural network model based on the candidate unit structure and training the neural network model;

generator construction module 1320 may be used to add a mask module to the base line network to construct a corresponding generator.

The generation countermeasure network model construction module 1330 may be configured to use the trained neural network model as a baseline network; constructing a generated countermeasure network model through the base line network, the generator and the discriminator, and training the generated countermeasure network model;

the node to be clipped determination module 1340 may be configured to determine an intermediate state node to be clipped according to the trained generated countermeasure network model.

The network structure determining module 1350 may be configured to clip the intermediate state node to be clipped in the neural network model to obtain a neural network model with a simplified structure.

The voiceprint feature extraction module 1360 may be configured to retrain the reduced structure neural network model, and extract voiceprint features of the target to be identified by using the trained reduced structure neural network model to obtain a voiceprint feature vector of the target to be identified;

The recognition result determining module 1370 may be configured to determine a voiceprint recognition result according to a similarity between the voiceprint feature vector of the target to be recognized and the voiceprint feature vector with the tag.

In one exemplary embodiment of the present disclosure, the network model construction module 1310 includes:

The unit structure searching module can be used for searching and obtaining candidate unit structures, namely Normal units and Reduction units by utilizing a gradient-based neural network architecture searching method.

And the unit stacking module can be used for alternately stacking the Normal units and the Reduction units according to the number of the set network units and the stacking rule to form a neural network main framework.

And the classification module is added and can be used for setting a classification layer after the neural network main body framework to obtain a neural network model.

The first training module may be used to train a neural network model.

In one exemplary embodiment of the present disclosure, the generating an countermeasure network model building module 1330 includes:

The label adding module can be used for marking the classification layer predicted value of the base line network as a true label; the classification layer predictor of the generator is marked as a false label.

The arbiter construction module may be configured to use the multi-layer perceptron network as an arbiter.

The antagonism network forming module can be used for performing classification learning on the output of the base line network and the output of the generator by utilizing the discriminator to form and generate the antagonism network model.

And the second training module can be used for training the generated countermeasure network model.

In one exemplary embodiment of the present disclosure, the second training module comprises:

And the discriminant training module can be used for fixing the parameter values of the generator and the mask module and training the discriminant by adopting training data so as to update the parameters of the discriminant.

And the generator training module can be used for fixing the parameters of the updated discriminant, and training the generator by adopting training data so as to update the parameters of the generator and the mask module.

In one exemplary embodiment of the present disclosure, the network structure determination module 1350 includes:

and the clipping module clips the intermediate state node to be clipped of each unit, the edge connected with the intermediate state node and the front node and the edge connected with the intermediate state node and the rear node respectively to form a corresponding new unit structure.

And the combination module adopts a new unit structure to recombine according to the network architecture of the neural network model, so as to obtain the neural network model with a simplified structure.

In one exemplary embodiment of the present disclosure, the voiceprint feature extraction module 1360 includes:

the voice acquisition module can be used for acquiring voice data to be recognized;

The activity detection module can be used for detecting the voice activity of the voice data to be recognized to obtain effective voice data;

The spectrogram extraction module can be used for extracting a spectrogram of the effective voice data, and taking the spectrogram as a target to be identified;

The feature vector extraction module can be used for extracting voiceprint features of the target to be identified through the candidate unit structure of the trained neural network model with the simplified structure, and synthesizing the voiceprint features through the full connection layer to obtain the voiceprint feature vector of the target to be identified.

In one exemplary embodiment of the present disclosure, the recognition result determining module 1370 includes:

The similarity calculation module can be used for calculating cosine similarity between the voiceprint feature vector of the object to be identified and the voiceprint feature vector with the tag.

And the identification module can be used for determining the voiceprint identification result according to the cosine similarity.

The specific details of each module or unit in the voiceprint recognition device are described in detail in the corresponding voiceprint recognition method, so that the details are not repeated here.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 2, 3, 7, 10, etc.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

It should be noted that the computer system 1400 of the electronic device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 14, the computer system 1400 includes a Central Processing Unit (CPU) 1401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for system operation are also stored. The CPU 1401, ROM 1402, and RAM 1403 are connected to each other through a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

The following components are connected to the I/O interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN card, a modem, and the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. When being executed by a Central Processing Unit (CPU) 1401, performs the various functions defined in the method and apparatus of the present application.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although the steps of the methods of the present disclosure are illustrated in a particular order in the figures, this does not require or imply that the steps must be performed in that particular order or that all of the illustrated steps must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc., all are considered part of the present disclosure.

It should be understood that the present disclosure disclosed and defined herein extends to all alternative combinations of two or more of the individual features mentioned or evident from the text and/or drawings. All of these different combinations constitute various alternative aspects of the present disclosure. Embodiments of the present disclosure describe the best mode known for carrying out the disclosure and will enable one skilled in the art to utilize the disclosure.

Claims

1. A method of voiceprint recognition comprising:

2. The voiceprint recognition method of claim 1, wherein the determining a candidate cell structure; based on the candidate unit structure, constructing a neural network model, including:

3. The voiceprint recognition method according to claim 2, wherein the method further comprises:

4. The voiceprint recognition method according to claim 1, wherein adding a masking module to the neural network model constructs a corresponding generator, comprising:

5. The voiceprint recognition method of claim 4, wherein the creating an countermeasure network model by the base line network, the generator, and the arbiter construction comprises:

taking the multi-layer sensor network as a discriminator;

6. The method of claim 5, wherein the training the generated challenge network model comprises:

fixing the parameter values of the generator and the mask module, and training the discriminator by adopting training data so as to update the parameters of the discriminator;

And fixing the parameters of the updated discriminant, and training the generator by adopting training data so as to update the parameters of the generator and the mask module.

7. The voiceprint recognition method of claim 4, wherein the determining an intermediate state node to be trimmed from the trained generated challenge network model comprises:

8. The voiceprint recognition method of claim 1, wherein clipping the intermediate state node to be clipped in the neural network model comprises:

9. The voiceprint recognition method of claim 1, wherein retraining the reduced structure neural network model comprises:

Acquiring voice training data; the voice training data comprises a spectrogram and a label thereof extracted from the voice data;

And training network parameters of the simplified structure neural network model by adopting the voice training data, and iteratively updating the network parameters of the simplified structure neural network model by adopting a gradient descent algorithm and a cross entropy loss function in the training process.

10. The method for identifying voiceprint according to claim 1, wherein the extracting voiceprint features of the object to be identified using the trained reduced structure neural network model comprises:

Acquiring voice data to be recognized;

Performing voice activity detection on the voice data to be recognized to obtain effective voice data;

extracting a spectrogram of the effective voice data, and taking the spectrogram as a target to be identified;

and extracting voiceprint features of the target to be identified through the candidate unit structure of the trained neural network model with the simplified structure, and integrating the voiceprint features through the full connection layer to obtain a voiceprint feature vector of the target to be identified.

11. The method according to claim 1, wherein determining the voiceprint recognition result according to the similarity between the voiceprint feature vector of the object to be recognized and the voiceprint feature vector with the tag comprises:

calculating cosine similarity between the voiceprint feature vector of the target to be identified and the voiceprint feature vector with the tag;

and determining a voiceprint recognition result according to the cosine similarity.

12. A voiceprint recognition apparatus, comprising:

13. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-11.

14. An electronic device, comprising:

one or more processors;

Storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1-11.