CN115083421B

CN115083421B - Method and device for constructing automatic parameter-searching speech identification model

Info

Publication number: CN115083421B
Application number: CN202210859650.6A
Authority: CN
Inventors: 陶建华; 王成龙; 易江燕; 张震; 李鹏; 石瑾; 杜金浩
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-15
Anticipated expiration: 2042-07-21
Also published as: CN115083421A

Abstract

The present disclosure relates to a method and a device for constructing an automatic parameter-searching speech discrimination model, which can automatically adjust hyper-parameters to be optimal, and the method comprises the following steps: acquiring voice data in a training set; performing feature extraction on voice data based on a pre-training voice feature extraction model to obtain voice features; inputting voice characteristics into the lightweight micro-structure and taking the voice characteristics as initial nodes, and searching a network structure according to candidate operation predefined in a search space to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a subsequent node; taking a real result that the voice data is true voice or false voice as a training label, and adjusting the structure weight corresponding to candidate operation among nodes of the candidate network structure in the training process; simplifying the candidate network structure according to the trained structure weight to obtain a target network structure; and generating a voice identification model according to the target network structure.

Description

Method and device for constructing automatic parameter-searching speech identification model

Technical Field

The disclosure relates to the technical field of artificial intelligence and voice recognition, in particular to a method and a device for constructing a voice identification model capable of automatically searching parameters.

Background

With the development of Artificial Intelligence (AI) technology, neural network models are widely used in the fields of computer vision, natural language processing, and the like.

When the existing neural network model is applied to the field of natural language processing, aiming at the problem of true and false voice identification, the common method is to design the hyper-parameters of the network structure by depending on the experience of experts, and then to train and actually measure based on the set model. In some schemes, parameters for voice feature extraction need to be set manually, for example, a Linear Frequency Cepstrum Coefficient (LFCC) feature needs to set a window length, a Fast Fourier Transform (FFT) length, filter set coefficients, and the like. When the living body or the non-living body object in the shot image is identified, the hyper-parameters of the network structure (the parameters preset by the model and the training stage are determined and not updated) need to be designed by the experience of experts, and the hyper-parameters such as the network structure, the number of network layers, the size of a convolution kernel, the number of convolution kernels and the size of a stride are all set in advance.

However, this method of setting based on experience is susceptible to experience, which results in unstable output performance of the neural network model.

Disclosure of Invention

In order to solve or at least partially solve the technical problems described above, embodiments of the present disclosure provide a method and an apparatus for constructing a speech recognition model with automatic reference finding.

In a first aspect, an embodiment of the present disclosure provides a method for constructing a speech recognition model for automatic parameter finding. The construction method of the voice identification model comprises the following steps: acquiring voice data in a training set, wherein the voice data comprises real and forged voice data; performing feature extraction on the voice data based on a pre-training voice feature extraction model to obtain voice features; inputting the voice features into a lightweight micro-structure as an initial node, and searching a network structure according to candidate operation predefined by a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a next node; taking the real result that the voice data is true voice or false voice as a training label, and adjusting the structure weight corresponding to the candidate operation among the nodes of the candidate network structure in the training process; simplifying the candidate network structure according to the trained structure weight to obtain a target network structure; and generating a voice identification model according to the target network structure.

In a second aspect, embodiments of the present disclosure provide a method for constructing a living body identification model for automatic reference finding. The construction method of the living body identification model comprises the following steps: acquiring image data in a training set, wherein the image data comprises image data obtained by shooting a living object and a non-living object; performing feature extraction on the image data based on a pre-training image feature extraction model to obtain image features; inputting the image characteristics into a lightweight micro-structure as an initial node, and searching a network structure according to candidate operation predefined by a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a next node; taking a real result that the object of the image data obtained by shooting is a living body or a non-living body as a training label, and adjusting the structure weight corresponding to the candidate operation among the nodes of the candidate network structure in the training process; simplifying the candidate network structure according to the trained structure weight to obtain a target network structure; and generating a living body identification model according to the target network structure.

According to an embodiment of the present disclosure, the predefined candidate operations include: 3 × 3 separable convolution, 5 × 5 separable convolution, 3 × 3 dilated convolution, 5 × 5 dilated convolution, 3 × 3 average pooling, 3 × 3 maximum pooling, skip concatenation, zero, and maximum feature map, wherein the maximum feature map is selected from the element level to be the larger of the corresponding elements in two input feature matrices of the same dimension.

According to an embodiment of the present disclosure, in the first training, the structural weight of the previous node of the lightweight micro-structure reaching the subsequent node via a plurality of candidate operations is a randomly generated value; in the kth training process, k is more than or equal to 2 and is an integer, the structural weight corresponding to each candidate operation is adjusted according to the loss function representing the difference between the output result of the candidate network structure corresponding to the structural weight of the kth-1 training and the training label, and the adjusted structural weight is used as the structural weight corresponding to the voice feature input in the kth training; and when the training times reach a set value or the loss function is smaller than a set threshold value, the training is considered to be finished.

According to an embodiment of the present disclosure, in the lightweight micro-structure, for each subsequent node after the initial node, the node values of the target subsequent nodes of the plurality of candidate branch paths corresponding to the same target previous node are determined by: the node value of the subsequent node of the candidate branch path corresponding to the specific candidate operation = the node value of the target previous node × the softmax value of the structural weight of the specific candidate operation, and when the target previous node is the initial node, the corresponding node value is the input feature of the lightweight micro-structure described above.

According to an embodiment of the present disclosure, the simplifying the candidate network structure according to the trained structure weight to obtain a target network structure includes: screening out a candidate branch path with the maximum structure weight in the plurality of candidate branch paths aiming at the plurality of candidate branch paths among each group of adjacent nodes; and determining the candidate branch path with the highest structure weight as a target path, wherein a network formed by the target path and the corresponding target node is used as the target network structure.

According to an embodiment of the present disclosure, generating a speech discrimination model according to the target network structure includes: using the target network structure as a voice identification model; or, performing parameter optimization on the target network structure to obtain a voice identification model. Wherein, carrying out parameter optimization on the target network structure to obtain a voice identification model, comprises: and constructing optimization of the structural weight between the nodes in the target network structure and the network weight inside the operation into a two-stage optimization problem, wherein in the two-stage optimization problem, the constraint conditions comprise: based on the corresponding optimal network weight when the loss function of the training set converges; the objective function is: when the optimal network weight is fixedly adopted in the operation, solving the corresponding optimal structure weight when the loss function based on the verification set is converged; generating an actual training set and an actual verification set according to the corpus; based on the actual training set and the actual verification set, alternately executing the following two steps until an optimal solution is found: inputting the actual training set into the target network structure, and updating the network weight of the target network structure based on a gradient descent method; and taking the updated network weight as a parameter value inside the operation of the target network structure, inputting the actual verification set into the target network structure with a fixed parameter value inside the operation, and updating the structure weight of the target network structure with the fixed parameter value inside the operation based on a gradient descent method.

According to an embodiment of the present disclosure, the pre-training speech feature extraction model is a Wav2vec model, the Wav2vec model includes two convolutional neural networks, which are an encoder network and a context network, respectively, and the encoder network is configured to output an encoding vector corresponding to a hidden layer space from the input speech data according to a division step size; the context network is used for outputting the voice characteristics which are corresponding to each step and carry context information according to the coding vectors corresponding to a plurality of continuous steps.

According to the embodiment of the disclosure, the Wav2vec model performs self-supervision learning based on the unlabeled voice data in the pre-training stage, and the corresponding loss functionL _k Comprises the following steps:

，

wherein the content of the first and second substances,h _k （c _i ) To representc _i (ii) affine transformation of;c _i voice characteristics carrying context information representing an output of the context network;krepresents a predicted step size;iindicating the step sequence number of the division;

denotes z _{i k+} Transpose of (b), z _{i k+} Representing a step size ofi+kA corresponding encoded vector; σ () represents a sigmoid function;

represent

The transpose of (a) is performed,

representing a distribution of probabilities fromp _n Sampling a negative sample;

representing the desire to take a negative example; λ is a hyperparameter.

According to an embodiment of the present disclosure, generating a living body authentication model according to the target network structure includes: taking the target network structure as a living body identification model; or performing parameter optimization on the target network structure to obtain a living body identification model. Wherein, carrying out parameter optimization on the target network structure to obtain a living body identification model, comprises: and constructing optimization of the structural weight between the nodes in the target network structure and the network weight inside the operation into a two-stage optimization problem, wherein in the two-stage optimization problem, the constraint conditions comprise: based on the corresponding optimal network weight when the loss function of the training set converges; the objective function is: when the optimal network weight is fixedly adopted in the operation, solving the corresponding optimal structure weight when the loss function based on the verification set is converged; generating an actual training set and an actual verification set according to the corpus; based on the actual training set and the actual verification set, alternately executing the following two steps until an optimal solution is found: inputting the actual training set into the target network structure, and updating the network weight of the target network structure based on a gradient descent method; and taking the updated network weight as a parameter value inside the operation of the target network structure, inputting the actual verification set into the target network structure with a fixed parameter value inside the operation, and updating the structure weight of the target network structure with the fixed parameter value inside the operation based on a gradient descent method.

In a third aspect, embodiments of the present disclosure provide an auto-referencing model building apparatus. The above model construction device includes: the system comprises a data acquisition module, a feature extraction module, a candidate network structure search construction module, a structure weight adjustment module, a target network structure generation module and a model construction module. The data acquisition module is used for acquiring training data in a training set, the training data comprises at least one of voice data or image data, the voice data comprises real and fake voice data, and the image data comprises image data obtained by shooting a living object and a non-living object. The feature extraction module is used for extracting features of the training data based on a pre-training model to obtain training features. The candidate network structure searching and constructing module is used for inputting the training characteristics into a lightweight micro-structure and taking the training characteristics as initial nodes, and searching a network structure according to candidate operation predefined by a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a next node. The structure weight adjusting module is used for adjusting the structure weight corresponding to the candidate operation among the nodes of the candidate network structure in the training process based on the training label corresponding to the training data. And the target network structure generation module is used for simplifying the candidate network structure according to the trained structure weight to obtain the target network structure. The model building module is used for generating a target model according to the target network structure, and the target model comprises at least one of a voice identification model and a living body identification model.

In a fourth aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and a processor for implementing the above-described method for constructing a speech recognition model for auto-referencing or a living body recognition model for auto-referencing when executing the program stored in the memory.

In a fifth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The above-mentioned computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of constructing the auto-referencing voice authentication model or the method of constructing the auto-referencing living body authentication model as described above.

The technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:

inputting the voice features extracted by the pre-training voice feature extraction model into a lightweight micro-structure and taking the voice features as initial nodes, and carrying out network structure search according to candidate operation predefined in a search space of the lightweight micro-structure to obtain a candidate network structure containing all candidate branch paths and candidate nodes; the real result that the voice data is true or false voice is used as a training label, the structural weight corresponding to the candidate operation between the nodes of the candidate network structure is adjusted in the training process, the structural weight after optimization of each operation can be given according to the training effect in each candidate branch path in the candidate network structure, and the hyper-parameters of the network structure are adapted to the voice data and can be optimized; the most possible connection relation between adjacent nodes can be represented by the relative size of the structural weight of the candidate operation between the nodes in the candidate network structure obtained after training is completed, the candidate network structure is simplified according to the structural weight, and the voice identification model generated according to the simplified target network structure has stable and good voice identification performance and does not need to manually set hyper-parameters of the model according to expert experience.

Some technical solutions provided by the embodiments of the present disclosure have at least some or all of the following advantages:

by setting the maximum feature map in the predefined candidate operation, effective screening of signals can be realized, so that filtering of noise and retention of useful signals are realized, and the method is suitable for scenes of voice identification and image identification and has universality.

the method comprises the steps of carrying out searching of a model structure in a searching space and optimizing structure weight by combining a Wav2vec model and a lightweight micro-structure, carrying out self-supervision training on a large amount of unlabelled voice data by the Wav2vec model, carrying out training by using a contrast learning loss function after a large amount of mask codes are carried out on input data, enabling extracted features to reflect voice general features, and combining with the subsequent searching of the model structure and the optimization process of the structure weight by the lightweight micro-structure, so that not only can the model super-parameters be automatically adjusted along with the training process, but also super-parameters such as window length, fast Fourier Transform (FFT) length, filter set coefficients and the like do not need to be set in the feature extraction process like a conventional technology, and the stable model performance is integrally realized without manually setting the super-parameters in the whole ring.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or technical solutions in the prior art, the drawings used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 schematically illustrates a flow diagram of a method of building a auto-referencing speech discrimination model according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a process of searching a network structure for candidate network structures according to a lightweight micro-architecture of an embodiment of the disclosure;

FIG. 3 schematically illustrates a process diagram for adjusting structure weights corresponding to candidate operations among nodes of a candidate network structure in a training process according to an embodiment of the disclosure;

FIG. 4A is a schematic diagram illustrating a target network structure obtained by refining candidate network structures according to trained structure weights according to an embodiment of the disclosure;

fig. 4B schematically illustrates a process diagram of performing structure search and structure weight tuning on a candidate network structure in a training process under a scenario of 9 predefined candidate operations according to an embodiment of the present disclosure;

FIG. 5 is a diagram schematically illustrating an implementation scenario of end-to-end speech data processing based on a speech discrimination model according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a process diagram of pre-training a Wav2vec model according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a flow chart of a method of constructing a living body identification model for automatic reference seeking according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a model building apparatus for auto-referencing according to an embodiment of the present disclosure; and

fig. 9 schematically shows a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

A first exemplary embodiment of the present disclosure provides a method of constructing a speech discrimination model with automatic reference.

Fig. 1 schematically shows a flowchart of a method of constructing a self-referencing speech discrimination model according to an embodiment of the present disclosure.

Referring to fig. 1, a method for constructing a auto-reference speech discrimination model according to an embodiment of the present disclosure includes the following steps: s110, S120, S130, S140, S150, and S160.

In step S110, voice data in the training set is obtained, and the voice data includes real and fake voice data.

In step S120, feature extraction is performed on the voice data based on the pre-training voice feature extraction model to obtain voice features.

The pre-training speech feature extraction model is a feature extraction model which is pre-trained with or without supervision based on corpus data.

In step S130, the voice features are input into the lightweight micro-structure and used as initial nodes, and a network structure search is performed according to candidate operations predefined in the search space of the lightweight micro-structure, so as to obtain a candidate network structure including all candidate branch paths and candidate nodes. The candidate operation characterizes a network connection relationship from a previous node to a next node.

In the embodiment of the present disclosure, the lightweight micro-structure refers to a network structure including a plurality of nodes, and the network structure is constructed by performing search learning in a search space.

Fig. 2 schematically shows a process diagram of a network structure search to obtain a candidate network structure according to a lightweight micro-structure of the present disclosure.

Referring to FIG. 2, in the light-weight microstructure, the first layeriA node is represented asx ^i（）， i

Is

0, 1, 2 … …. The initial node is node 0, denoted asx ^（0） The initial node is the input speech feature, and each directional edge (i,j) Representing from a previous node (which may also be described as a previous node) tox ^i（） Subsequent node (which can also be described as a subsequent node)x ^j（） Corresponds to a candidate operation, i.e., the candidate operation characterizes a network connection relationship from a previous node to a next node. Directed edge (i,j) And nodex ^i（） Associated with and associated with a number of possible candidate operations o (i,j) And (6) associating.

The previous node and the subsequent node may be sequentially adjacent nodes or sequentially non-adjacent nodes, for example, the nodes are sequentially expressed as follows according to the sequence:x ^（0）、x ^（1）、x ^（2） … …, the subsequent node is generated by the previous node and the possible candidate operation. At a nodex ^（0） And nodex ^（1） There is a possibility of connection by at least one of the above-mentioned candidate operations, at the nodex ^（1） And nodex ^（2） Between or at nodesx ^（0） And nodex ^（2） There is also a possibility of being connected by at least one of the candidate operations, and so on, a network structure search is performed according to the predefined candidate operation, and a candidate network structure including all candidate branch paths and candidate nodes can be obtained. In the searching process, the candidate branch paths corresponding to all possible candidate operations are searched,for simplicity, fig. 2 only illustrates the process of searching the network structure corresponding to two candidate operations to obtain a candidate network structure including three nodes, and illustrates two different candidate operations in different line types, for example, there are two possible candidate branch paths from node 0 to node 1 in the obtained candidate network structure, which correspond to the candidate operation of 3 × 3 separable convolution and the operation of 3 × 3 average pooling, respectively; there are two possible candidate branch paths from node 1 to node 2, corresponding to the candidate operation of 3 x 3 separable convolution and the operation of 3 x 3 average pooling, respectively; similarly, there are two possible candidate branch paths from node 0 to node 2, corresponding to the candidate operation of 3 × 3 separable convolution and the operation of 3 × 3 average pooling, respectively. Similarly, according to the type of candidate operation predefined in the search space, a plurality of candidate branch paths and corresponding candidate nodes can be correspondingly searched, so as to obtain a corresponding candidate network structure.

It should be noted that the same preceding node (for example, the node isx ^（0） ) Subsequent nodes corresponding to different operations (e.g. are executedx ^（1） ) The node values of (1) are different, for example, in fig. 2, node 1 corresponds to a node value indicated by two ellipses, and a plurality of node values of the same subsequent node should be understood as a plurality of possible nodes in parallel at the same position (for example, specific values are respectively adoptedx ^（1-1） Andx ^（1-2） differentiated representation) and corresponding to different candidate branch paths, which are respectively indicated by the reference numerals (1) - (8).

According to an embodiment of the present disclosure, in the lightweight micro-structure, for each subsequent node after the initial node, the node values of the target subsequent nodes of the plurality of candidate branch paths corresponding to the same target previous node are determined by: the node value of the subsequent node of the candidate branch path corresponding to the specific candidate operation = the node value of the target previous node × the softmax value of the structural weight of the specific candidate operation, and when the target previous node is the initial node, the corresponding node value is the input feature of the lightweight micro-structure described above, which is the input voice feature in the present embodiment.

For example, taking the simplified candidate network structure illustrated in fig. 2 as an example, for the target node 1 corresponding to the two candidate branch paths (1) and (2) of the node 0, the values of the node 1 on the left side corresponding to the candidate branch path (1) corresponding to the 3 × 3 separable convolution are:x ^（1-1） =x ^（0） [ specifically, inputted speech characteristics]Xsofmax (3 x 3 structural weight of separable convolution this particular candidate operation).

softmax (structural weight of the particular candidate operation for 3 × 3 separable convolution) = exp (structural weight of the particular candidate operation for 3 × 3 separable convolution)/{ exp (structural weight of the particular candidate operation for 3 × 3 separable convolution) + exp (structural weight of the particular candidate operation for 3 × 3 average pooling) }. That is, the denominator is the sum of the e-exponent values for each candidate operation over the search space.

In one embodiment, the search space is predefined with 9 candidate operations. The predefined candidate operations include: 3 × 3 separable convolution, 5 × 5 separable convolution, 3 × 3 dilated convolution, 5 × 5 dilated convolution, 3 × 3 average pooling, 3 × 3 maximum pooling, skip concatenation, zero, and maximum feature map, wherein the maximum feature map is selected from the element level to be the larger of the corresponding elements in two input feature matrices of the same dimension.

In step S140, a real result that the voice data is true voice is used as a training label, and a structure weight corresponding to a candidate operation between nodes of the candidate network structure is adjusted in a training process.

In embodiments of the present disclosure, the node connection and activation functions are combined into one matrix, where each element represents the weight of the connection and activation functions. During searching, the lightweight micro-structure searches paths in a search space to find the average weight of relevant connections (the stability of the network structure is realized through soft maximization, and the weight is limited in the range of 0~1), so that the search space is changed into a continuous space.

Fig. 3 schematically illustrates a process diagram for adjusting structure weights corresponding to candidate operations between nodes of a candidate network structure in a training process according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, referring to fig. 3, at the time of first training, the structural weight of the previous node of the lightweight micro-structure reaching the subsequent node through a plurality of candidate operations is a randomly generated value.

For example, the structural weight corresponding to the candidate operation between the corresponding nodes during each training is expressed as follows:

α ₁ ^（0,1） : structural weights corresponding to the candidate branch path (1) from node 0 to node 1 (or described as a corresponding candidate operation, e.g., a 3 × 3 separable convolution);

α ₂ ^（0,1） : structural weights corresponding to candidate branch path (2) (or described as corresponding candidate operations, e.g., 3 × 3 average pooling) from node 0 to node 1;

α ₁ ^（1,2） : represents the value of the slave node 1x ^（1-1） The structural weight corresponding to the candidate branch path (3) to node 2 (or alternatively described as a corresponding candidate operation, e.g., a 3 x 3 separable convolution);

α ₂ ^（1,2） : represents the value taken from node 1x ^（1-1） Structural weights corresponding to the candidate branch paths (4) to node 2 (or described as corresponding candidate operations, e.g., 3 x 3 average pooling);

α ₃ ^（1,2） : represents the value of the slave node 1x ^（1-2） The structural weight corresponding to the candidate branch path (5) to node 2 (or alternatively described as a corresponding candidate operation, e.g., a 3 x 3 separable convolution);

α ₄ ^（1,2） : represents the value taken from node 1x ^（1-2） Structural weights corresponding to the candidate branch paths (6) to node 2 (or described as corresponding candidate operations, e.g., 3 x 3 average pooling);

α ₁ ^（0,2） : structural weights corresponding to the candidate branch path (7) from node 0 to node 2 (or alternatively described as corresponding candidate operations, e.g., 3 x 3 separable convolutions);

α ₂ ^（0,2） : representing the structural weight corresponding to the candidate branch path (8) from node 0 to node 2 (or alternatively described as the corresponding candidate operation, e.g., 3 x 3 average pooling).

In the kth training process, k is more than or equal to 2 and is an integer, the structural weight corresponding to each candidate operation is adjusted according to the loss function representing the difference between the output result of the candidate network structure corresponding to the structural weight of the kth-1 training and the training label, and the adjusted structural weight is used as the structural weight corresponding to the voice feature input in the kth training; and when the training times reach a set value or the loss function is smaller than a set threshold value, the training is considered to be finished.

In the training process, the structural weight corresponding to the candidate operation among the nodes of the candidate network structure is adjusted, and the structural weight after each operation is optimized can be given according to each candidate branch path in the candidate network structure based on the training effect, so that the hyper-parameters of the network structure are adapted to the voice data and can be optimized.

In step S150, the candidate network structures are simplified according to the trained structure weights, so as to obtain a target network structure.

FIG. 4A is a schematic diagram illustrating a target network structure obtained by simplifying candidate network structures according to trained structure weights according to an embodiment of the present disclosure; fig. 4B schematically shows a process diagram of performing structure search and structure weight tuning on a candidate network structure in a training process under a scenario of 9 predefined candidate operations according to an embodiment of the present disclosure.

According to the embodiment of the present disclosure, in the step S150, the training is completedThe structure weight, simplify above-mentioned candidate network structure, obtain the target network structure, include: screening out a candidate branch path with the maximum structure weight in the plurality of candidate branch paths aiming at the plurality of candidate branch paths among each group of adjacent nodes; the candidate branch path with the highest structure weight is determined as the target path, and the network formed by the target path and the corresponding target node is used as the target network structure, and referring to fig. 4A, a plurality of structure weights α between node 0 and node 1 are schematically shown ₁ ^（0,1） And alpha ₂ ^（0,1） Middle maximum alpha ₁ ^（0,1） Determining the corresponding candidate branch path (1) as a target path between the node 0 and the node 1; a plurality of structural weights alpha between the node 1 and the node 2 ₁ ^（1,2）、α ₂ ^（1,2）、α ₃ ^（1,2）、α ₄ ^（1,2） Middle maximum alpha ₂ ^（1,2） The corresponding candidate branch path (4) is determined as a target path between the node 1 and the node 2, and since the structural weights of the candidate branch paths between the node 0 and the node 2 are all 0 when the training is completed, no target branch path exists between the node 0 and the node 2.

According to the 9 predefined candidate operations provided by the embodiment of the present disclosure, referring to fig. 4B, the dashed box highlights various candidate operations corresponding to node 0 to node 1: 3 × 3 separable convolutions (illustrated in fig. 4B as 3 × 3 \ u sep _conv), 5 × 5 separable convolutions (illustrated in fig. 4B as 5 × 5 \ u sep _conv), 3 × 3 dilated convolutions, 5 × 5 dilated convolutions, 3 × 3 average pooling, 3 × 3 Max pooling, skipped connections (indicating connections between non-adjacent nodes), zeros (indicating no operation, no connection between corresponding two nodes), and a maximum feature map (illustrated in fig. 4B as Max-feature-map (MFM)), and respective candidate operations O are illustrated ₁ 、O ₂ 、……、O ₉ And adjusting the structure weight in the corresponding training process to respectively correspond to the following results: * S _o1 、*S _o2 、……、*S _o9 The sum of these structural weights is 1.

In step S160, a speech recognition model is generated according to the target network structure.

According to an embodiment of the present disclosure, generating a speech discrimination model according to the target network structure includes: using the target network structure as a voice identification model; or, performing parameter optimization on the target network structure to obtain a voice identification model.

When the parameter optimization is performed, both the structure weight and the network weight inside each operation (for example, if the operation is a 3 × 3 convolution, the network weight includes a parameter corresponding to a convolution kernel) are optimized.

Based on the steps S110-S160, inputting the voice features extracted from the pre-training voice feature extraction model into a lightweight micro-structure and using the voice features as initial nodes, and performing network structure search according to candidate operations predefined in a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; the real result that the voice data is true voice is used as a training label, the structural weight corresponding to the candidate operation among the nodes of the candidate network structure is adjusted in the training process, the structural weight after each operation is optimized can be given according to the training effect in each candidate branch path in the candidate network structure, and the hyper-parameters of the network structure are adaptive to the voice data and can be optimized; the relative size of the structure weight of the candidate operation between the nodes in the candidate network structure obtained after training can represent the most probable connection relation between the adjacent nodes, the candidate network structure is simplified according to the structure weight, and the voice identification model generated according to the simplified target network structure has stable and good voice identification performance and does not need to manually set the hyper-parameters of the model according to expert experience.

According to the embodiment of the disclosure, the parameter optimization is performed on the target network structure to obtain the voice identification model, and the method comprises the following steps: and constructing optimization of the structural weight between the nodes in the target network structure and the network weight inside the operation into a two-stage optimization problem, wherein in the two-stage optimization problem, the constraint conditions comprise: based on the corresponding optimal network weight when the loss function of the training set is converged; the objective function is: when the optimal network weight is fixedly adopted in the operation, solving the corresponding optimal structure weight when the loss function based on the verification set is converged; generating an actual training set and an actual verification set according to the corpus; based on the actual training set and the actual verification set, alternately executing the following two steps until an optimal solution is found: inputting the actual training set into the target network structure, and updating the network weight of the target network structure based on a gradient descent method; and taking the updated network weight as a parameter value inside the operation of the target network structure, inputting the actual verification set into the target network structure with a fixed parameter value inside the operation, and updating the structure weight of the target network structure with the fixed parameter value inside the operation based on a gradient descent method.

The objective function and constraints of the two-stage optimization problem can be expressed as the following expressions:

，

，

（1）

wherein, the first and the second end of the pipe are connected with each other,

a loss function representing a validation set;

a loss function representing a training set;αrepresenting the structure weight as an upper variable;wnetwork weights (which may be convolution kernels, for example) representing the inside of the operation, are the underlying variables;w*and (α) represents an optimal network weight.

Fig. 5 schematically illustrates an implementation scenario of end-to-end speech data processing based on a speech discrimination model according to an embodiment of the present disclosure.

In an improved embodiment, the pre-trained speech feature extraction model is a Wav2vec model 510 (a speech feature extraction model obtained by performing self-supervised training on a large amount of unlabeled speech data). Referring to fig. 5, inputting voice data to be processed into a Wav2vec model 510 for feature extraction to obtain voice features; then, the voice features are input into the voice identification model 520 constructed based on the lightweight micro-structure, and the processing result that the voice data to be processed is the real voice and the forged voice can be output.

Fig. 6 schematically shows a process diagram of the pre-training of the Wav2vec model according to an embodiment of the present disclosure.

Referring to fig. 6, the Wav2vec model 600 includes two convolutional neural networks, namely an encoder network 610 and a context network 620, and the encoder network is configured to output the input speech data to a coding vector corresponding to a hidden layer space (i.e. a feature coding vector of each frame to obtain a frame feature sequence) according to a division step size. In addition, the feature coding vector of each frame is converted into discrete features q and serves as an automatic supervision target. And simultaneously, performing masking operation on the frame feature sequence and inputting the frame feature sequence into a Transformer context network 620, wherein the context network 620 outputs the voice features carrying context information corresponding to each step according to the coding vectors corresponding to a plurality of continuous steps.

Referring to fig. 6, the above-mentioned Wav2vec model 600 performs the self-supervised learning based on the unlabeled voice data in the pre-training stage, and performs a large amount of masks on the input and then uses the contrast learning loss functionL _k 。

The corresponding loss function is trained by the Wav2vec modelL _k Comprises the following steps:

，（2）

wherein the content of the first and second substances,h _k （c _i ) Representc _i Affine transformation of (2);c _i a voice feature carrying context information representing an output of the context network;krepresents a predicted step size;iindicating the step sequence number of the division;

to represent

The method (2) is implemented by the following steps,

representing the desire to take a negative example; λ is a hyperparameter.

In the embodiment containing the Wav2vec model, the Wav2vec model and the lightweight micro-structure are combined to search a model structure in a search space and optimize the structure weight, because the Wav2vec model carries out self-supervision training on a large amount of unlabelled voice data, carries out training by using a contrast learning loss function after carrying out a large amount of masks on input, and the extracted features can reflect voice general features and are combined with the subsequent process of searching the model structure and optimizing the structure weight of the lightweight micro-structure, not only can the model hyper-parameters be automatically adjusted along with the training process, but also hyper-parameters such as window length, fast Fourier (FFT) conversion length, filter set coefficients and the like do not need to be set in the feature extraction process like a conventional technology, the model performance is stable on the whole, and the whole loop does not need to manually set the hyper-parameters, namely, the hyper-parameters in the feature extraction stage do not need to be set, and the hyper-parameters of the neural network model do not need to be set.

Based on the same technical concept, a second exemplary embodiment of the present disclosure provides a method of constructing an automatic reference finding living body identification model.

Fig. 7 schematically shows a flowchart of a method of constructing a living body identification model for automatic reference seeking according to an embodiment of the present disclosure.

Referring to fig. 7, a method for constructing a living body identification model according to an embodiment of the present disclosure includes the following steps: s710, S720, S730, S740, S750, and S760.

In step S710, image data in a training set is acquired, the image data including image data obtained by photographing a living subject and a non-living subject.

For example, the non-living object is a medium-borne non-living object, such as paper, a photo-borne human face picture or photograph, a plaster-corresponding human face sculpture, and the like.

In step S720, feature extraction is performed on the image data based on the pre-training image feature extraction model to obtain image features.

In an embodiment, the pre-training image feature extraction model is a feature extraction model obtained by performing deep learning in advance based on corpus data.

In step S730, the image features are input into the lightweight micro-structure and used as initial nodes, and a network structure search is performed according to candidate operations predefined in the search space of the lightweight micro-structure, so as to obtain a candidate network structure including all candidate branch paths and candidate nodes. Wherein the candidate operation characterizes a network connection relationship from a previous node to a next node.

According to an embodiment of the present disclosure, the predefined candidate operations include: 3 x 3 separable convolution, 5 x 5 separable convolution, 3 x 3 dilated convolution, 5 x 5 dilated convolution, 3 x 3 average pooling, 3 x 3 maximum pooling, skipped connections, zeros, and maximum feature map, where the maximum feature map is the selection of the larger of the corresponding elements in two input feature matrices of the same dimension from the element level.

According to an embodiment of the present disclosure, in the light-weight micro-structure, for each subsequent node after the initial node, the node value of the target subsequent node of the plurality of candidate branch paths corresponding to the same target previous node is determined by the following formula: the node value of the subsequent node of this candidate branch path corresponding to the particular candidate operation = the node value of the target previous node x the softmax value of the structural weight of the particular candidate operation, i.e. the right side of the formula is: the node value of the target previous node × softmax (α), α representing the structural weight of the specific candidate operation, when the target previous node is the initial node, the corresponding node value is the input feature of the lightweight microstructure described above, which is the input image feature in the present embodiment.

In step S740, a real result that the object of the image data is a living body or a non-living body is taken as a training label, and a structure weight corresponding to the candidate operation between the nodes of the candidate network structure is adjusted in the training process.

In step S750, the candidate network structures are simplified according to the trained structure weights, so as to obtain a target network structure.

In step S760, a living body authentication model is generated based on the target network structure.

According to an embodiment of the present disclosure, in the step S760, generating a living body identification model according to the target network structure includes: taking the target network structure as a living body identification model; or carrying out parameter optimization on the target network structure to obtain a living body identification model.

Wherein, carrying out parameter optimization on the target network structure to obtain a living body identification model, comprises: and constructing optimization of the structural weight between nodes in the target network structure and the network weight inside the operation into a two-stage optimization problem, wherein in the two-stage optimization problem, the constraint conditions comprise: based on the corresponding optimal network weight when the loss function of the training set converges; the objective function is: when the optimal network weight is fixedly adopted in the operation, solving the corresponding optimal structure weight when the loss function based on the verification set is converged; generating an actual training set and an actual verification set according to the corpus; based on the actual training set and the actual verification set, alternately executing the following two steps until an optimal solution is found: inputting the actual training set into the target network structure, and updating the network weight of the target network structure based on a gradient descent method; and taking the updated network weight as a parameter value inside the operation of the target network structure, inputting the actual verification set into the target network structure with a fixed parameter value inside the operation, and updating the structure weight of the target network structure with the fixed parameter value inside the operation based on a gradient descent method.

Based on the steps S710-S760, inputting the image features extracted by the pre-training image feature extraction model into the lightweight micro-structure and using the image features as initial nodes, and performing network structure search according to candidate operation predefined by a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; taking a real result that the object of the image data obtained by shooting is a living body or a non-living body as a training label, adjusting the structure weight corresponding to the candidate operation among the nodes of the candidate network structure in the training process, and giving the structure weight of each operation based on the training effect in each candidate branch path in the candidate network structure so that the hyper-parameters of the network structure are adapted to the image data and can be optimal; the relative size of the structural weight of the candidate operation between the nodes in the candidate network structure obtained after training can represent the most probable connection relation between the adjacent nodes, the candidate network structure is simplified according to the structural weight, an image identification model generated according to the simplified target network structure has stable and good image identification performance, and the hyper-parameters of the model do not need to be manually set according to expert experience.

Compared with the first embodiment, the two embodiments are only the differences of processing objects (voice and image), the overall execution logics are similar, and for other details of the lightweight microstructure training and the construction of the living body identification model, reference may be made to the description of the first embodiment, and no further description is given here.

A third exemplary embodiment of the present disclosure provides a parameter-homing model building apparatus.

Fig. 8 schematically shows a block diagram of a structure of a model building apparatus for auto-referencing according to an embodiment of the present disclosure.

Referring to fig. 8, an apparatus 800 for constructing a model for auto-referencing according to an embodiment of the present disclosure includes: a data acquisition module 801, a feature extraction module 802, a candidate network structure search construction module 803, a structure weight adjustment module 804, a target network structure generation module 805, and a model construction module 806.

The data acquiring module 801 is configured to acquire training data in a training set, where the training data includes at least one of voice data and image data, the voice data includes real and fake voice data, and the image data includes image data obtained by shooting a living object and a non-living object.

The feature extraction module 802 is configured to perform feature extraction on the training data based on a pre-training model to obtain training features.

The candidate network structure search building module 803 is configured to input the training features into a lightweight micro-structure and use the lightweight micro-structure as an initial node, and perform network structure search according to candidate operations predefined in a search space of the lightweight micro-structure to obtain a candidate network structure including all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a next node.

The structure weight adjusting module 804 is configured to adjust a structure weight corresponding to the candidate operation between the nodes of the candidate network structure in a training process based on the training label corresponding to the training data. When the training data is voice data, using the real result that the voice data is true voice as a training label; when the training data is image data, a real result that an object of which the image data is captured is a living body or a non-living body is used as a training label.

The target network structure generation module 805 is configured to simplify the candidate network structure according to the trained structure weight, so as to obtain a target network structure.

The model building module 806 is configured to generate a target model according to the target network structure, where the target model includes at least one of a voice authentication model and a living body authentication model.

The execution logic of each functional module included in the apparatus 800 may refer to the detailed description in the first embodiment and the second embodiment, and specific execution logic is incorporated as a specific functional sub-module into this embodiment.

Any number of the functional modules included in the apparatus 800 may be combined into one module to be implemented, or any one of the functional modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. At least one of the functional modules included in the apparatus 800 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging circuits, or in any one of three implementations, or in any suitable combination of any of the software, hardware, and firmware. Alternatively, at least one of the functional modules comprised by the apparatus 800 may be implemented at least partly as a computer program module, which when executed, may perform a corresponding function.

A fourth exemplary embodiment of the present disclosure provides an electronic apparatus.

Referring to fig. 9, an electronic device 900 provided in the embodiment of the present disclosure includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 complete communication with each other through the communication bus 904; a memory 903 for storing computer programs; the processor 901 is configured to implement the above-described method for constructing the auto-referencing voice authentication model or the auto-referencing living body authentication model when executing the program stored in the memory.

A fifth exemplary embodiment of the present disclosure also provides a computer-readable storage medium. The above-mentioned computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the above-mentioned method of constructing the auto-referential speech authentication model or the auto-referential living body authentication model.

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing an automatic reference-seeking speech authentication model is characterized by comprising the following steps:

acquiring voice data in a training set, wherein the voice data comprises real and fake voice data;

performing feature extraction on the voice data based on a pre-training voice feature extraction model to obtain voice features;

inputting the voice features into a lightweight micro-structure and taking the voice features as initial nodes, and searching a network structure according to candidate operation predefined by a search space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a subsequent node;

taking the real result that the voice data is true voice or false voice as a training label, and adjusting the structure weight corresponding to the candidate operation among the nodes of the candidate network structure in the training process;

simplifying the candidate network structure according to the trained structure weight to obtain a target network structure;

generating a voice identification model according to the target network structure;

wherein the predefined candidate operations include:

3 x 3 separable convolution, 5 x 5 separable convolution, 3 x 3 dilated convolution, 5 x 5 dilated convolution, 3 x 3 average pooling, 3 x 3 maximum pooling, skipped connections, zeros, and maximum feature map, where the maximum feature map is the selection of the larger of the corresponding elements in two dimensionally identical input feature matrices from the element level.

2. The building method according to claim 1, wherein, at the time of first training, a structural weight of a preceding node of the lightweight micro-structure reaching a subsequent node via a plurality of candidate operations is a randomly generated value; in the kth training process, k is more than or equal to 2 and is an integer, the structural weight corresponding to each candidate operation is adjusted according to the loss function representing the gap between the output result of the candidate network structure corresponding to the structural weight of the kth-1 training and the training label, and the adjusted structural weight is used as the structural weight corresponding to the voice feature input in the kth training; and when the training times reach a set value or the loss function is smaller than a set threshold value, the training is considered to be finished.

3. The building method according to claim 2, wherein in the lightweight micro-structurable, for each subsequent node after an initial node, a node value of a target subsequent node of a plurality of candidate branch paths corresponding to the same target previous node is determined by the following formula:

node value of a subsequent node of the candidate branch path corresponding to the specific candidate operation = node value of a target previous node × softmax value of a structural weight of the specific candidate operation, when the target previous node is an initial node, the corresponding node value is an input feature of the lightweight micro-structure.

4. The method according to claim 1, wherein the obtaining a target network structure by simplifying the candidate network structure according to the trained structure weight comprises:

screening out a candidate branch path with the maximum structure weight in a plurality of candidate branch paths aiming at the plurality of candidate branch paths among each group of adjacent nodes;

and determining the candidate branch path with the maximum structure weight as a target path, wherein a network formed by the target path and the corresponding target node is used as the target network structure.

5. The method of construction according to claim 1, wherein generating a speech discrimination model based on the target network structure comprises:

using the target network structure as a voice authentication model; alternatively, the first and second electrodes may be,

performing parameter optimization on the target network structure to obtain a voice identification model;

wherein, optimizing the parameters of the target network structure to obtain a voice identification model comprises:

and constructing optimization of the structural weight between nodes in the target network structure and the network weight in the operation into a two-stage optimization problem, wherein in the two-stage optimization problem, constraint conditions comprise: based on the corresponding optimal network weight when the loss function of the training set converges; the objective function is: when the optimal network weight is fixedly adopted in the operation, solving the corresponding optimal structure weight when the loss function based on the verification set is converged;

generating an actual training set and an actual verification set according to the corpus;

alternately executing the following two steps until finding the optimal solution based on the actual training set and the actual verification set: inputting the actual training set into the target network structure, and updating the network weight of the target network structure based on a gradient descent method; and taking the updated network weight as a parameter value inside the operation of the target network structure, inputting the actual verification set into the target network structure with a fixed parameter value inside the operation, and updating the structural weight of the target network structure with a fixed parameter value inside the operation based on a gradient descent method.

6. The construction method according to claim 1, wherein the pre-trained speech feature extraction model is a Wav2vec model, the Wav2vec model includes two convolutional neural networks, respectively an encoder network and a context network, the encoder network is configured to output the input speech data to an encoding vector corresponding to a hidden layer space according to a division step size; and the context network is used for outputting the voice characteristics which are corresponding to each step and carry context information according to the coding vectors corresponding to the continuous multiple steps.

7. An apparatus for constructing a reference-automatic speech discrimination model, comprising:

the data acquisition module is used for acquiring training data in a training set, wherein the training data comprises voice data, and the voice data comprises real and fake voice data;

the characteristic extraction module is used for extracting the characteristics of the training data based on a pre-training model to obtain training characteristics;

the candidate network structure searching and constructing module is used for inputting the training characteristics to the lightweight micro-structure to be used as initial nodes, and carrying out network structure searching according to candidate operation predefined in a searching space of the lightweight micro-structure to obtain a candidate network structure comprising all candidate branch paths and candidate nodes; wherein the candidate operation characterizes a network connection relationship from a previous node to a subsequent node;

a structure weight adjusting module, configured to adjust, based on a training label corresponding to the training data, a structure weight corresponding to a candidate operation between nodes of the candidate network structure in a training process;

the target network structure generation module is used for simplifying the candidate network structure according to the trained structure weight to obtain a target network structure;

the model construction module is used for generating a voice identification model according to the target network structure;

wherein the predefined candidate operations include:

3 × 3 separable convolution, 5 × 5 separable convolution, 3 × 3 dilated convolution, 5 × 5 dilated convolution, 3 × 3 average pooling, 3 × 3 maximum pooling, skip join, zero, and maximum feature map, wherein the maximum feature map is a selection of the larger of the corresponding elements in two input feature matrices of the same dimension from the element level.

8. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 6 when executing a program stored on a memory.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-6.