CN113744729A

CN113744729A - Speech recognition model generation method, device, equipment and storage medium

Info

Publication number: CN113744729A
Application number: CN202111095442.5A
Authority: CN
Inventors: 朱文涛; 陆顺; 孔天龙; 李吉祥; 张大威; 邓峰; 王晓瑞; 杨森; 刘霁
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-03

Abstract

The disclosure provides a method, a device, equipment and a storage medium for generating a voice recognition model, and relates to the technical field of network information processing to obtain a better voice recognition model. The method comprises the following steps: acquiring voice sample data; constructing a first voice recognition super network, wherein the first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search characteristics, and the search characteristics comprise the number of branches, network layer dimensions and channel selection dimensions; the search features comprise the number of branches, network layer dimensions and channel selection dimensions; based on the voice sample and the voice sample label, training operation is carried out on the first voice recognition hyper-network to obtain a second voice recognition hyper-network; the voice sample label is used as an expected identification value corresponding to the voice sample; performing network search on the second voice recognition super network to obtain a target voice recognition sub network; and retraining the target speech recognition sub-network to obtain a speech recognition model.

Description

Speech recognition model generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of network information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a speech recognition model.

Background

Compared to other biometric technologies, such as: face identification, fingerprint identification and iris identification, voice information is gathered more easily, simultaneously, to voice information identification's application, also gradually extends to each field.

The current speech recognition technology can perform speech recognition through a Time-Delay Neural Network (TDNN) model based on deep learning, an extended Time-Delay Neural Network (E-TDNN) model, and the like, however, the error rate of the existing speech recognition is high, thereby affecting the final speech recognition effect.

Disclosure of Invention

The present disclosure provides a speech recognition model generation method, apparatus, device, and storage medium to provide a more optimal speech recognition model. The technical scheme of the disclosure is as follows:

according to a first aspect of the present disclosure, there is provided a speech recognition model generation method including: the electronic equipment acquires voice sample data; the voice sample data comprises a voice sample and a voice sample label; constructing a first voice recognition super network, wherein the first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search characteristics, and the search characteristics comprise the number of branches, network layer dimensions and channel selection dimensions; based on the voice sample and the voice sample label, training operation is carried out on the first voice recognition hyper-network to obtain a second voice recognition hyper-network; the voice sample label is used as an expected identification value corresponding to the voice sample; performing network search on the second voice recognition super network to obtain a target voice recognition sub network; the target voice recognition sub-network comprises a multi-layer network structure, and each layer of network structure corresponds to a combination of search characteristic values; and retraining the target speech recognition sub-network to obtain a speech recognition model.

Optionally, the method further comprises: determining the number of a plurality of branches of each layer of network structure in the first voice recognition super network, and constructing a branch module of each layer of network structure according to the number of each branch; and performing first data processing on the characteristics of the branch modules of each layer of the network structure to obtain a plurality of network layer dimensions of each layer of the network structure, wherein the first data processing comprises at least one of merging processing, multi-order processing or splicing processing.

Optionally, the method further comprises: determining the number of a plurality of branches of each layer of network structure in the first voice recognition super network, and constructing a plurality of branch modules of each layer of network structure according to the number of each branch; and performing second data processing on the channel selection layer of the branch module of each layer of the network structure to obtain a plurality of channel selection dimensions of each layer of the network structure, wherein the second data processing comprises full connection processing and/or matrix processing.

Optionally, based on the voice sample and the voice sample tag, performing a training operation on the first voice recognition super network to obtain a second voice recognition super network, including: step A: randomly sampling a plurality of different combinations of the search feature values corresponding to each layer of the network structure in the first voice recognition super network to obtain a combination of the search feature values corresponding to each layer of the network structure; obtaining a first voice recognition sub-network based on a combination of the corresponding search feature values of each layer of network structure; and B: training the first voice recognition sub-network according to the voice sample and the voice sample label to obtain a second voice recognition sub-network; and C: synchronizing parameters in the second speech recognition subnetwork into the first speech recognition subnetwork; and (4) iteratively executing the step A to the step C to obtain a second voice recognition hyper-network.

Optionally, the voice sample includes a plurality of groups of voice subsamples, and step B includes: training the first voice recognition sub-network in multiple batches according to the voice samples and the voice sample labels to obtain multiple second voice recognition sub-networks; wherein each batch of training employs a set of speech subsamples.

Optionally, performing a network search on the second speech recognition super network to obtain a target speech recognition sub-network, including: sampling the second voice recognition super network for multiple times to obtain a plurality of third voice recognition sub-networks, wherein the sampling comprises randomly sampling a plurality of different combinations of the search feature values corresponding to each layer of network structure in the second voice recognition super network, and obtaining the third voice recognition sub-networks according to one combination of the search feature values corresponding to each layer of network structure obtained by random sampling; determining error rates for a plurality of third speech recognition subnetworks; and determining the third voice recognition sub-network with the error rate meeting the preset condition as the target voice recognition sub-network in the plurality of third voice recognition sub-networks.

Optionally, determining an error rate of the plurality of third speech recognition subnetworks comprises: determining an initial error rate for a plurality of third speech recognition subnetworks; adjusting parameters in the plurality of third voice recognition sub-networks based on the initial error rate to obtain a plurality of optimized third voice recognition sub-networks; determining an error rate of the optimized third speech recognition subnetwork; determining a third speech recognition sub-network with an error rate meeting a preset condition from the plurality of third speech recognition sub-networks as a target speech recognition sub-network, comprising: and determining the optimized third voice recognition sub-networks with the error rates meeting the preset conditions in the optimized third voice recognition sub-networks as the target voice recognition sub-networks.

Optionally, determining an initial error rate of the plurality of third speech recognition subnetworks comprises: and inputting the combination of the search feature values corresponding to each layer of the network structure of the third voice recognition sub-networks into the error rate prediction model to obtain the initial error rates of the third voice recognition sub-networks, wherein the initial error rates are used for representing the voice sample recognition capability of the third voice recognition sub-networks.

Optionally, adjusting parameters in the plurality of third speech recognition subnetworks based on the initial error rate to obtain a plurality of optimized third speech recognition subnetworks, including: performing parameter adjustment operation on the plurality of third voice recognition sub-networks to obtain a plurality of optimized third voice recognition sub-networks, wherein the parameter adjustment operation comprises the following steps: determining updating directions of parameters in the third voice recognition sub-networks according to the initial error rates of the third voice recognition sub-networks and a preset collection function; the parameters in the plurality of third speech recognition subnetworks are updated according to the update direction.

Optionally, determining update directions of parameters in a plurality of third speech recognition subnetworks according to the initial error rate of the third speech recognition subnetwork and a preset collection function, includes: processing the target parameter pair by adopting a multivariate Gaussian distribution function to make the target parameter pair obey multivariate Gaussian distribution; the target parameter pair comprises a value of the target parameter and an initial error rate corresponding to the target parameter, the target parameter being each parameter in each of the plurality of third speech recognition sub-networks; and combining the multivariate Gaussian distribution function, and obtaining the updating direction of the value of the target parameter under the condition of maximizing the preset acquisition function.

Optionally, retraining the target speech recognition subnetwork to obtain a speech recognition model, including: inputting the voice sample into a target voice recognition sub-network to obtain a recognition label of the voice sample; determining a loss value between the recognition tag and the voice sample tag based on a loss function; and according to the loss value, iteratively updating the parameters of the target speech recognition sub-network to obtain a speech recognition model.

Optionally, the method further includes: acquiring a voice signal to be recognized; and inputting the voice signal to be recognized into the voice recognition model to obtain a voice sample label corresponding to the voice signal to be recognized.

According to a second aspect of the present disclosure, there is provided a speech recognition model generation apparatus including an acquisition module and a processing module. An acquisition module configured to acquire voice sample data; the voice sample data comprises a voice sample and a voice sample label; the processing module is configured to construct a first voice recognition super network, the first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search features, and the search features comprise branch numbers, network layer dimensions and channel selection dimensions; the processing module is also configured to execute training operation on the first voice recognition hyper-network based on the voice sample and the voice sample label to obtain a second voice recognition hyper-network; the voice sample label is used as an expected identification value corresponding to the voice sample; a processing module further configured to perform a network search on the second speech recognition super network to obtain a target speech recognition sub-network; the target voice recognition sub-network comprises a multi-layer network structure, and each layer of network structure corresponds to a combination of search characteristic values; and the processing module is also configured to retrain the target speech recognition sub-network to obtain a speech recognition model.

Optionally, the processing module is further configured to determine a number of multiple branches of each layer of network structure in the first voice recognition super network, and construct a branch module of each layer of network structure according to the number of each branch; the processing module is further configured to perform first data processing on the characteristics of the branch modules of each layer of the network structure to obtain a plurality of network layer dimensions of each layer of the network structure, and the first data processing includes at least one of merging processing, multi-order processing or splicing processing. Optionally, the processing module is further configured to determine a number of multiple branches of each layer of network structure in the first voice recognition super network, and construct multiple branch modules of each layer of network structure according to the number of each branch; and the processing module is also configured to perform second data processing on the channel selection layer of the branch module of each layer of the network structure to obtain a plurality of channel selection dimensions of each layer of the network structure, and the second data processing comprises full connection processing and/or matrix processing.

Optionally, the processing module is further configured to execute step a: randomly sampling a plurality of different combinations of the search feature values corresponding to each layer of the network structure in the first voice recognition super network to obtain a combination of the search feature values corresponding to each layer of the network structure; obtaining a first voice recognition sub-network based on a combination of the corresponding search feature values of each layer of network structure; a processing module further configured to perform step B: training the first voice recognition sub-network according to the voice sample and the voice sample label to obtain a second voice recognition sub-network; a processing module further configured to perform step C: synchronizing parameters in the second speech recognition subnetwork into the first speech recognition subnetwork; and the processing module is also configured to execute the iterative execution steps A-C to obtain a second voice recognition hyper-network.

Optionally, the processing module is further configured to train the first speech recognition sub-networks in multiple batches according to the speech samples and the speech sample tags, so as to obtain multiple second speech recognition sub-networks; wherein each batch of training employs a set of speech subsamples.

Optionally, the processing module is further configured to perform sampling processing on the second voice recognition super network for multiple times to obtain multiple third voice recognition subnetworks, where the sampling processing includes performing random sampling on multiple different combinations of search feature values corresponding to each layer of network structure in the second voice recognition super network, and obtaining the third voice recognition subnetworks according to one combination of search feature values corresponding to each layer of network structure obtained by the random sampling; a processing module further configured to determine error rates for a plurality of third speech recognition subnetworks; and the processing module is also configured to determine a third voice recognition sub-network with an error rate meeting a preset condition in the plurality of third voice recognition sub-networks as the target voice recognition sub-network.

Optionally, the processing module is further configured to determine an initial error rate of a plurality of third speech recognition subnetworks; the processing module is further configured to adjust parameters in the plurality of third voice recognition sub-networks based on the initial error rate to obtain a plurality of optimized third voice recognition sub-networks; a processing module further configured to determine an error rate of the optimized third speech recognition subnetwork; and the processing module is also configured to determine the optimized third voice recognition sub-networks with the error rates meeting the preset conditions in the optimized third voice recognition sub-networks as the target voice recognition sub-networks.

Optionally, the processing module is further configured to input a combination of the search feature values corresponding to each layer of the network structure of the plurality of third speech recognition subnetworks into the error rate prediction model, so as to obtain an initial error rate of the plurality of third speech recognition subnetworks, where the initial error rate is used to characterize the capability of the third speech recognition subnetwork to recognize the speech sample.

Optionally, the processing module is further configured to perform parameter tuning operation on the plurality of third speech recognition sub-networks to obtain a plurality of optimized third speech recognition sub-networks, where the parameter tuning operation includes: determining updating directions of parameters in the third voice recognition sub-networks according to the initial error rates of the third voice recognition sub-networks and a preset collection function; the parameters in the plurality of third speech recognition subnetworks are updated according to the update direction.

Optionally, the processing module is further configured to process the target parameter pair by using a multivariate gaussian distribution function, so that the target parameter pair obeys multivariate gaussian distribution; the target parameter pair comprises a value of the target parameter and an initial error rate corresponding to the target parameter, the target parameter being each parameter in each of the plurality of third speech recognition sub-networks; and the processing module is also configured to obtain the updating direction of the value of the target parameter under the condition that the preset acquisition function is maximized by combining the multivariate Gaussian distribution function.

Optionally, the processing module is further configured to input the voice sample into the target voice recognition sub-network, so as to obtain a recognition tag of the voice sample; a processing module further configured to determine a loss value between the recognition tag and the voice sample tag based on a loss function; and the processing module is also configured to iteratively update the parameters of the target speech recognition sub-network according to the loss value to obtain a speech recognition model.

Optionally, the processing module is further configured to acquire a speech signal to be recognized; and the processing module is also configured to input the voice signal to be recognized into the voice recognition model to obtain a voice sample label corresponding to the voice signal to be recognized.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement any one of the alternative speech recognition model generation methods as described in the first aspect above.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform any one of the optional speech recognition model generation methods as described in the first aspect above.

According to a fifth aspect of the present disclosure, there is provided a computer program product containing instructions for implementing an alternative speech recognition model generation method as described in any of the first aspects above when the instructions in the computer program product are executed by a processor of an electronic device.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

in the scheme, the constructed first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search features, and the search features comprise the number of branches, network layer dimensions and channel selection dimensions; based on voice sample data, training the first voice recognition super network to obtain a second voice recognition super network, performing network search on the second voice recognition super network to obtain a target voice recognition sub-network, and training the target recognition sub-network to obtain a voice recognition model. The number of branches represents the number of parallel sub-networks in each layer of network structure, and the parallel sub-networks with different numbers correspond to different data processing capacities; the network layer dimension determines the complexity of the learned features, and the learning difficulty can be directly influenced; the channel number dimension determines the learning depth of each layer of network, so that the network structure can be more accurately expressed. Therefore, the number of branches, the network layer dimension and the channel selection dimension not only affect the complexity of the network structure, but also are directly related to the performance of the network structure. According to the method, the first voice recognition super network with lower complexity and better performance is established by determining the proper number of branches related to performance, the network layer dimension and the channel selection dimension, and then the voice recognition model with better performance can be obtained from the second voice recognition super network with lower complexity and better performance through network search. Moreover, the specific values of the number of branches, the network layer dimension and the channel selection dimension are suitable values determined from a plurality of possible values, and are not fixed, and the suitable values can obtain better network performance. That is, a network structure with better performance can be obtained from a plurality of possible network structures corresponding to a plurality of possible values. Compared with the prior art, each layer of network structure is a model with fixed parameters; the network structure design of the present disclosure is more reasonable, and brings better voice recognition effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is one of the flow diagrams illustrating a method of generating a speech recognition model according to an exemplary embodiment;

FIG. 2 is a second flowchart illustrating a method of generating a speech recognition model according to an exemplary embodiment;

FIG. 3 illustrates a schematic diagram of the determination of search characteristics;

FIG. 4 is a third flowchart illustrating a method of generating a speech recognition model according to an exemplary embodiment;

FIG. 5A is a fourth flowchart illustrating a method of generating a speech recognition model in accordance with an exemplary embodiment;

FIG. 5B is a fifth flowchart illustrating a method of generating a speech recognition model according to an exemplary embodiment;

FIG. 6A shows a schematic diagram of a random sampling application;

FIG. 6B shows a Bayesian optimization application diagram;

FIG. 6C illustrates a diagram of a target speech recognition self-network retraining application;

FIG. 7 is a sixth flowchart illustrating a method of generating a speech recognition model according to an exemplary embodiment;

FIG. 8 shows a trend graph for different model parameters and error rates;

FIG. 9 is a seventh flowchart illustrating a method of generating a speech recognition model in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating the structure of a speech recognition model generation apparatus in accordance with an exemplary embodiment;

fig. 11 is a schematic structural diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The data to which the present disclosure relates may be data that is authorized by a user or sufficiently authorized by parties. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components.

Based on the background technology, the embodiment of the disclosure provides a voice recognition model generation method, which includes constructing a first voice recognition super network, wherein the first voice recognition super network comprises a multi-layer network structure, each layer of network structure corresponds to a plurality of different combinations of search feature values, and the search features comprise branch numbers, network layer dimensions and channel selection dimensions; based on voice sample data, training the first voice recognition super network to obtain a second voice recognition super network, performing network search on the second voice recognition super network to obtain a target voice recognition sub-network, and training the target recognition sub-network to obtain a voice recognition model. The speech recognition model can use less parameter quantity and obtain better processing effect.

The following is an exemplary description of a speech recognition model generation method provided by the embodiments of the present disclosure:

the speech recognition model generation method provided by the disclosure can be applied to electronic equipment.

In some embodiments, the electronic device may be a server, a terminal, or other electronic devices for performing voice recognition, which is not limited in this disclosure.

The server may be a single server, or may be a server cluster including a plurality of servers. In some embodiments, the server cluster may also be a distributed cluster. The present disclosure is also not limited to a specific implementation of the server.

The terminal may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR), a Virtual Reality (VR) device, and other devices that can install and use a content community application (e.g., a fast hand), and the specific form of the electronic device is not particularly limited by the present disclosure. The system can be used for man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment and the like.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

As shown in fig. 1, when the speech recognition model generation method is applied to an electronic device, the speech recognition model generation method may include:

and step 11, the electronic equipment acquires voice sample data.

The voice sample data comprises a voice sample and a voice sample label.

In some embodiments, the electronic device preprocesses the raw speech data using a kaldi framework, extracts Mel-scale Frequency Cepstral Coefficients (MFCC) features of the speech samples every 10ms after the preprocessing, then mutes the MFCC features using Cepstral Mean Normalization (CMN), and after the muting is removed, partitions the Frequency spectrum of the MFCC features into speech samples of 200 frames or 400 frames. Illustratively, the MFCC features may be 30-dimensional or may be 80-dimensional. The voice samples are various audio signals needing to be identified, such as: speech audio, singing voice audio, etc. The voice sample label is a real user corresponding to the voice sample.

And step 12, the electronic equipment constructs a first voice recognition hyper-network.

The first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search features, and the search features comprise the number of branches, network layer dimensions and channel selection dimensions.

In some embodiments, a search is performed in the search space based on the voice sample data, and the first language identification network is constructed from the search results. The Search space is a Search space corresponding to a target model that a user wants to Search through a Neural Architecture Search technology (NAS). After the search space is determined, the multi-layer network structure of the first language identification network is searched in the search space by combining the voice sample data. Each layer of network structure in the first speech recognition network corresponds to a plurality of different combinations of search feature values, one combination of search feature values corresponding to one node. Each layer of the network structure in the first speech recognition network includes a plurality of nodes. The target model may be any model having speech recognition capabilities, and the present disclosure is not limited thereto.

For example, the target model can be a Dual attentional mechanism (Dual attention), ARET-25, Fast RestNet-34, TDNN, E-TDNN, F-TDNN, D-TDNN, and automatic speech (AutoSpeech), and the initial search space is a search space corresponding to the TDNN model.

Optionally, with reference to fig. 2, the method for generating a speech recognition model further includes:

and step 21, the electronic equipment determines the number of a plurality of branches of each layer of network structure in the first voice recognition super network, and constructs a branch module of each layer of network structure according to the number of each branch.

In some embodiments, a search is performed in the search space based on the voice sample data to determine a number of branches of the first layer network structure in the first voice recognition super network, and after a specific numerical value of each branch number is determined, a branch module corresponding to each branch number in the first layer network structure is created according to the specific numerical value (the number of the branch modules is consistent with the number of the branches). Specifically, one branch module is a branch network, and network parameters of each branch network are different. The network parameters specifically refer to different expansion rates of each branch module, and the expansion rates are used for determining the size of a processing window of context data and increasing the diversity of features, so that an optimal network structure is trained. If the expansion rate is smaller, the branch can extract refined features; if the expansion ratio is large, the branch can handle coarse-grained features. Therefore, more differentiated features can be extracted from the multiple branches of each layer of the network structure, and the judgment capability of the network can be improved.

Illustratively, the search space is an 18-layer network structure, and the number b of branches searched in the search space is e {2, 3} based on voice sample data. Referring to fig. 3, in the search space, if the number of branches of a certain layer network structure of the first voice recognition super network is 2, the layer network structure outputs 2 branch modules, and then the expansion rates of the 2 branches are set to 1 and 3. The branch module with the expansion rate of 1 is used for extracting refined features; a branching module with an expansion ratio of 3 is used to handle coarse-grained features. If the number of the branches of a certain layer network structure of the first voice recognition super network is 3, the layer network structure outputs 3 branch modules, and the expansion rate of the 3 branches is set to be 1, 3 and 5. The branch module with the expansion rate of 1 is used for extracting refined features; a branching module with an expansion ratio of 3 is used to handle coarser granularity features and a branching module with an expansion ratio of 5 is used to handle the coarsest granularity features.

And step 22, the electronic equipment performs first data processing on the characteristics of the branch modules of each layer of the network structure to obtain the network layer dimension of each layer of the network structure.

Wherein the first data processing includes at least one of a merge processing, a multi-stage processing, or a concatenation processing.

In some embodiments, the network layer dimension is a dimension of a core feature in each layer of the network structure in the first speech recognition super-network, which determines the learning capabilities of the layer of the network structure, and the discrimination capabilities of the features. The larger the dimensionality of the network layer is, the higher the complexity of feature learning is, the more complex features can be learned, and the learning difficulty is higher; the smaller the dimensionality of the network layer is, the lower the learning difficulty is, and the learned features are simple. By performing the first data processing on the plurality of branch modules of each layer of the network structure in the first voice recognition super network, a plurality of network layer dimensions of each layer of the network structure can be obtained.

After the number of the multiple branches of each layer of the network structure in the first voice recognition super network is determined, for a certain layer, executing the following operation on a branch module corresponding to each branch number in the multiple branch numbers of the layer of the network structure in the first voice recognition super network so as to obtain a network layer dimension corresponding to each branch number; specifically, the characteristics of all the branch modules corresponding to one branch number are combined; as shown in the following formula 1, after the merging process, calculating the feature mean value of all the branch modules, the feature variance of all the branch modules, the feature third-order feature of all the branch modules and the feature fourth-order feature of all the branch modules; finally, splicing the feature mean values of all the branch modules, the feature variance of all the branch modules, the feature third-order features of all the branch modules and the feature fourth-order features of all the branch modules, and calculating the network layer feature corresponding to the branch by using the full connection layer in the layer network structure after splicing, wherein the dimension corresponding to the network layer feature shown in formula 6 is the network layer dimension c. Illustratively, the network layer dimension c ∈ {64,96,128,192 }. The calculation of the network layer dimension can be realized by the following expression:

z ═ f ([ mu,. sigma.,. s, k ]) formula (6)

In the above expression, h is the feature corresponding to the selected dimension of the upper layer channel in the voice sample data (first layer) or the initial search space; b is divided intoThe number of branches; TDNNi is a Time-Delay Neural Network (TDNN) Network of the ith branch in the initial search space; t is the total length of the current input voice sample data; f is the full link layer; mu is the mean characteristic of the TDNN of the layer, and sigma is the variance characteristic of the TDNN of the layer respectively; s is a third order polynomial characteristic of the layer TDNN; k is a fourth order polynomial characteristic of the layer TDNN; z represents a network layer feature; f is a full connection layer in the layer network structure, and the dimension of the full connection layer is the dimension of the network layer of the layer; ht is the characteristic of the previous layer at the t-th moment;

the TDNN characteristic of the ith branch at the tth moment of the layer;

the TDNN characteristic at the tth time of the layer.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: steps 21 to 22 provide a method for calculating the number of branches and the network layer dimension, and due to the number of branches and the size of the network layer dimension, the complexity and the learning ability of creating the speech recognition super network are directly influenced, and the performance of the final speech recognition model is also indirectly influenced. Compared with the prior art that the number of branches and the network layer dimension are fixed numerical values, the method determines the appropriate number of branches and the network layer dimension based on the voice sample data, and accordingly provides a data basis for subsequently determining a voice recognition model with better performance.

And step 23, the electronic device performs second data processing on the channel selection layer of the branch module of each layer of the network structure to obtain the channel selection dimension of each layer of the network structure, wherein the second data processing includes matrix processing and/or normalization processing.

In some embodiments, the channel selection dimension is used to learn the discriminative power of channel selection, i.e., to enhance the decision on the most significant channel features for speech recognition. Illustratively, after the number of the multiple branches of each layer of the network structure is determined, for a certain layer, the channel selection layer characteristics of all branch modules corresponding to the ith branch number in the multiple branch numbers in the layer of the network structure are input into the full-connection layer, and the evaluation of the channel selection of all branch modules corresponding to the ith branch number after derivation is calculated; the assessment of channel selection for all branching modules is then normalized, as shown in equation 7 below; obtaining a channel selection matrix as shown in the following formula 8, and finally performing matrix processing on the channel selection matrix and the TDNN characteristic of the ith branch at the tth moment to obtain the characteristic of the layer of the densely connected time delay neural network; as shown in the following equation 9, the dimension corresponding to the feature of the layer of the densely connected delay neural network is the channel selection dimension d. Illustratively, the channel selection dimension d ∈ {32, 64 }. The calculation of the channel selection dimension can be realized by the following expression:

tⁱ＝f′_i(z) formula (7)

uⁱ＝softmax(tⁱ) Formula (8)

In the above expression, z is a network layer characteristic; fi is a full connection layer of the layer, and the dimension of the full connection layer is a channel selection dimension d; t is tⁱAn assessment of channel selection for the ith branch; u. ofⁱSelecting a matrix for the channel of the ith branch;

the TDNN characteristic of the ith branch at the tth moment of the layer; h'_tIs the characteristic of the layer of densely connected Time-Delay Neural Network (D-TDNN); z is the extracted network layer feature of the layer.

With reference to fig. 3, inputting voice sample data into a search space for searching, searching out a plurality of branch numbers of a first layer of a first voice recognition super network, then respectively outputting a branch module corresponding to each branch number in the first layer, and then performing first data processing on the branch module to obtain each network layer dimension corresponding to each branch number; and meanwhile, carrying out second data processing on the channel selection layer of the branch module to obtain each channel selection dimension corresponding to the number of each branch. Then, inputting a plurality of data characteristics corresponding to a plurality of channel selection dimensions based on the first layer into a second layer network structure to obtain a plurality of branch numbers, a plurality of network layer dimensions and a plurality of channel selection dimensions of the second layer network structure; then repeating the steps until obtaining a plurality of branch numbers, a plurality of network layer dimensions and a plurality of channel selection dimensions of the last layer; thus, all combinations of the multi-layer network mechanisms in the first voice recognition super network and the search characteristic values corresponding to each layer of the network structures are obtained.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: step 121 and step 123 provide a method for calculating the channel selection dimension, and since the channel selection dimension is directly related to the discrimination capability of the channel feature, the performance of the speech recognition model is indirectly affected. Compared with the prior art that the channel selection dimension is a fixed numerical value, the method determines the proper channel selection dimension, and provides a basis for subsequently obtaining a speech recognition model with better performance.

And step 13, the electronic equipment executes training operation on the first voice recognition super network based on the voice sample and the voice sample label to obtain a second voice recognition super network.

And the voice sample label is used as an expected identification value corresponding to the voice sample.

In some embodiments, a training operation is performed on the first speech recognition hyper-network with speech samples and speech sample labels to update parameters in the first speech recognition hyper-network, resulting in a second speech recognition hyper-network with higher speech recognition accuracy.

Optionally, with reference to fig. 4, in step 13, based on the voice sample and the voice sample tag, a training operation is performed on the first voice recognition super network to obtain a second voice recognition super network, where the training operation includes:

step 41: the electronic equipment randomly samples a plurality of different combinations of the search characteristic values corresponding to each layer of network structure in the first voice recognition super network to obtain one combination of the search characteristic values corresponding to each layer of network structure; and obtaining a first voice recognition sub-network based on a combination of the search feature values corresponding to each layer of the network structure.

Step 41 is for performing step a in the present disclosure.

In some embodiments, with reference to fig. 6A, the electronic device performs random sampling on a plurality of different combinations of values of the search features corresponding to the first-layer network structure in the first speech recognition super-network by using a random gradient descent technique, so as to obtain a combination of values of the search features corresponding to the first-layer network structure. That is, a plurality of nodes corresponding to a first layer network structure in a first voice recognition hyper-network are randomly sampled to obtain one node in the first layer network structure. And randomly sampling the second-layer network structure by the same method until the last-layer network structure is randomly sampled, and connecting a combination of all the search characteristic values sampled by all the layers according to the hierarchical structure of the first voice recognition super network to obtain a first voice recognition sub-network. Namely, the nodes sampled by all layers are connected according to the hierarchical structure of the first speech recognition super network to obtain a first speech recognition sub network. Illustratively, the electronic device performs random sampling according to a single-path uniform sampling method. By using single-path uniform sampling, each operator can be guaranteed to have uniform training opportunities, and the coupling degree between weights is reduced. Illustratively, the single-path uniform sampling satisfies the following expression:

w_A＝argmin_wE_a～U(A)[L_train(N(a，W(a)))]formula (10)

In the above expression, a represents the initial search space, W represents the weight of the super-net, and N (a, W) represents the search space encoded in the super-net.

And step 42, the electronic equipment trains the first voice recognition sub-network according to the voice samples and the voice sample labels to obtain a second voice recognition sub-network.

Step 42 is for performing step B in the present disclosure.

In some embodiments, the electronic device trains the first speech recognition subnetwork, in particular updates parameters in the first speech recognition subnetwork, with the speech samples and the speech sample labels, and thereby the second speech recognition subnetwork.

Further, referring to fig. 4, the voice sample includes a plurality of groups of voice sub-samples, and step 42 the electronic device trains the first voice recognition sub-network according to the voice sample and the voice sample label to obtain a second voice recognition sub-network, including:

step 421, the electronic device trains the first voice recognition sub-networks in multiple batches according to the voice samples and the voice sample labels to obtain multiple second voice recognition sub-networks.

Wherein each batch of training employs a set of speech subsamples.

In some embodiments, the speech samples include a plurality of sets of speech subsamples, and the electronic device divides training of the first speech recognition subnetwork into batches according to each set of speech subsamples in a round of training. In the first training, inputting a first group of voice subsamples into a first voice recognition sub-network to obtain recognition labels of the first group of voice subsamples; updating parameters in the first voice recognition sub-network based on the recognition labels of the first group of voice sub-samples and the voice sample labels of the first group of voice sub-samples to obtain a second voice recognition sub-network; and then carrying out a plurality of training batches on the voice subsamples of other groups in the same way until each group of voice subsamples is trained once to obtain a plurality of second voice recognition sub-networks. Specifically, when a first speech recognition subnetwork is trained by using a speech sample and a speech sample label, parameters of the speech recognition subnetwork are updated by using a cross entropy loss function, so that a plurality of second speech recognition subnetworks are obtained. Wherein the cross entropy loss function satisfies the following expression:

in the above expression, N is the number of voice samples, w is a learning parameter, T is a transposition, gi is an output characteristic of voice sample data, yi voice sample label, and C is the number of users corresponding to all voice samples.

Exemplarily, 100000 voice sample data are acquired; the voice sample data respectively comprises a voice sample and a voice sample label; 32 voice samples are taken as a group, and 3125 groups of data are formed. Inputting a first group of 32 sample data into a first voice recognition sub-network, outputting the recognition labels of the 32 voice samples by the first voice recognition sub-network, calculating loss according to the 32 recognition labels and the voice sample labels corresponding to the 32 voice samples, updating parameters of the first voice recognition sub-network according to the loss result and an error back propagation algorithm, thereby obtaining a second voice recognition sub-network, repeating the operation until 3125 groups of voice samples are used, and finally obtaining 3125 second voice recognition sub-networks.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: in step 421, the electronic device batches the voice samples, trains a first voice recognition sub-network with a batch of data, and obtains a second voice recognition sub-network with better performance through multiple batches of training.

Step 43, the electronic device synchronizes the parameters in the second speech recognition subnetwork to the first speech recognition subnetwork.

Step 43 is for performing step C in the present disclosure.

In some embodiments, after training the second speech recognition subnetwork, the second speech recognition subnetwork will automatically share the parameters to the first speech recognition subnetwork based on the parameter value sharing rule of the subnetwork, so as to obtain the first speech recognition subnetwork with updated parameters.

And step 44, the electronic equipment iteratively executes the steps A to C to obtain a second voice recognition hyper-network.

In some embodiments, after the electronic device obtains a first speech recognition sub-network through step a, the electronic device performs multiple batches of training on the first speech recognition sub-network through step B to obtain a plurality of second speech recognition sub-networks, then updates parameters of the plurality of second speech recognition sub-networks into the first speech recognition super-network based on step C, and finally, in the first speech recognition super-network after the parameters are updated, performs steps a to C again, and iterates multiple times of such operations to obtain a second speech recognition super-network.

Illustratively, a first speech recognition subnetwork is obtained by random sampling in the first speech recognition subnetwork; based on the 100000 voice sample data, training the first voice recognition sub-network 3125 times with 32 voice sample data as a batch to obtain 3125 second voice recognition sub-networks; synchronizing the parameters of the 3125 second speech recognition subnetworks into the first speech recognition hyper network; the random sampling in the first speech recognition super network after parameter updating is then continued, and this is repeated for 30 rounds, and 93750 times of random training are performed in total. Therefore, parameters in the first voice recognition hyper-network can be fully updated, and the second voice recognition hyper-network is finally obtained.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: step 41-step 44 provide a method for training a first speech recognition super network, wherein random sampling is performed in the first speech recognition super network, the first speech recognition sub network after random sampling is trained to obtain a second speech recognition sub network, finally, parameters of the second speech recognition sub network are updated to the first speech recognition super network, through multiple iterations, the performance of the second speech recognition super network obtained after training is better, and a speech recognition model with stronger recognition capability is obtained based on the second speech recognition super network with better performance.

And step 14, the electronic equipment carries out network search on the second voice recognition super network to obtain a target voice recognition sub network.

The target voice recognition sub-network comprises a multi-layer network structure, and each layer of network structure corresponds to one combination of search characteristic values.

In some embodiments, after the second speech recognition super-network is obtained, a network search is performed on the second speech recognition super-network to obtain the target speech recognition sub-network.

Optionally, with reference to fig. 1-2, as shown in fig. 5A, step 14 performs a network search on the second speech recognition super network to obtain a target speech recognition sub-network, including:

and step 51, the electronic equipment performs multiple sampling processing on the second voice recognition super network to obtain a plurality of third voice recognition sub networks.

The sampling processing comprises randomly sampling a plurality of different combinations of the values of the corresponding search features of each layer of network structure in the second voice recognition super network, and obtaining a third voice recognition sub-network according to one combination of the values of the corresponding search features of each layer of network structure obtained by random sampling.

In some embodiments, in combination with step 41, the electronic device randomly selects a plurality of different combinations of search feature values corresponding to each layer of the network structure in the second speech recognition super network, selects a combination of search feature values, combines the combinations of search feature values of all layers according to the hierarchical structure of the network structure to form a third speech recognition sub-network, and repeats the above operations for a plurality of times to obtain a plurality of third speech recognition sub-networks.

Step 52, the electronic device determines error rates for a plurality of third speech recognition subnets.

Specifically, the error rates of the plurality of third speech recognition subnetworks may be calculated by an error rate prediction model. And after the error rate of each third voice recognition sub-network is obtained, screening the third voice recognition sub-networks according to the error rate, thereby providing a reference basis for subsequently determining a target voice recognition sub-network.

Optionally, with reference to fig. 1-2, as shown in fig. 5A, the electronic device determines 52 error rates of a plurality of third speech recognition subnetworks, including:

step 521, the electronic device determines initial error rates for a plurality of third speech recognition subnetworks.

In some embodiments, in conjunction with step 52, an initial error rate for the plurality of third speech recognition subnetworks is calculated based on the error rate prediction model.

Optionally, as shown in fig. 5B, the step 521 the electronic device determining the initial error rates of the plurality of third speech recognition subnetworks includes:

5211, the electronic device inputs the combination of the values of the search features corresponding to each layer of the network structure of the plurality of third voice recognition subnetworks into the error rate prediction model to obtain the error rates of the plurality of third voice recognition subnetworks.

Wherein the error rate is used to characterize the ability of the third speech recognition subnetwork to recognize speech samples.

In some embodiments, the second speech recognition super-network is sampled for a plurality of times to obtain a plurality of third speech recognition sub-networks, and then a combination of values of the search feature corresponding to each layer of the network structure in each of the third speech recognition sub-networks in the plurality of third speech recognition sub-networks is input to the error rate prediction model, so as to obtain the initial error rate of each of the third speech recognition sub-networks. The error rate prediction model may be one of a variety of regression models. Exemplarily, the electronic device randomly samples 64 times in the second speech recognition super-network to obtain 64 third speech recognition sub-networks, and then inputs a combination of values of search features corresponding to each layer of network structures in the selected 64 third speech recognition sub-networks into the gaussian regression model, and the gaussian regression model outputs an error rate of the third speech recognition sub-networks; and according to the error rate of each third speech recognition sub-network output by the Gaussian regression model, providing reference for searching out the optimal network subsequently. Wherein the gaussian regression process determines an error rate (EER) using maximum log likelihood estimation. The gaussian regression model satisfies the following expression:

p|a～N(μ,K)，EER|p，σ²～N(p,σ²I) formula (12)

In the above expression, p represents an implicit variable, a represents each trained speech recognition sub-network, N represents a gaussian probability distribution, K is a hamming kernel function, and EER is an error rate index of recognition of the corresponding speaker.

When using the gaussian regression model to predict the error rate of the third speech recognition subnetworks, the gaussian regression model needs to be trained first. The training process of the Gaussian regression model is as follows: the electronic equipment obtains an initial network structure and a preset optimization function, determines a new network structure a1 based on the preset optimization function and the initial network structure, evaluates the error rate of the new network structure, repeats multiple rounds to obtain a plurality of new network structures ai and the error rates of the new network structures, and finally updates the parameters in the initial Gaussian regression model based on the error rates of the new network structures ai and the new network structures to obtain the Gaussian regression model which can be used for evaluating the error rate of the third speech recognition sub-network.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: after obtaining the plurality of third speech recognition subnetworks, the plurality of third speech recognition subnetworks need to be screened. The performance of the third speech recognition sub-network is screened, the performance of the third speech recognition sub-network is calculated through a Gaussian regression model, screening is carried out according to the calculation result, compared with other regression models, the result calculated through the Gaussian regression model is more accurate, and the finally screened target speech recognition sub-network is also better in performance.

Step 522, the electronic device adjusts parameters in the plurality of third speech recognition sub-networks based on the initial error rate to obtain a plurality of optimized third speech recognition sub-networks.

In some embodiments, after the initial error rates of the plurality of third speech recognition subnetworks are obtained, if the initial error rate is too high or the initial error rate exceeds a preset threshold, parameters in the plurality of third speech recognition subnetworks need to be adjusted, so as to obtain an optimized third speech recognition subnetwork. The existing parameter adjusting methods are various, such as: grid Search (Gird Search), Random Search (Random Search), and Bayesian Optimization (Bayesian Optimization).

Among them, grid search is the most common, and specifically, for each parameter, several values to be tried are determined, and then the combination of all parameter values is traversed like a grid. The advantage is that simple violence is achieved, and the results are more reliable if all traversals are possible. The disadvantage is that it is too time consuming, like in particular neural networks, to try too many parameter combinations in general. Random selection is more efficient than grid search. In practical operation, all candidate parameters are obtained by using a grid search method, and then training is performed by randomly selecting from the candidate parameters each time. Bayesian optimization is adopted, and experimental result values corresponding to different parameters are considered, so that time is saved.

Optionally, as shown in fig. 5B, in step 522, based on the initial error rate, parameters in a plurality of third speech recognition sub-networks are adjusted, so as to obtain a plurality of optimized third speech recognition sub-networks, including:

5221, the electronic device performs parameter tuning on the plurality of third voice recognition sub-networks to obtain a plurality of optimized third voice recognition sub-networks.

Wherein, the parameter adjusting operation comprises the following steps: determining updating directions of parameters in the third voice recognition sub-networks according to the initial error rates of the third voice recognition sub-networks and a preset collection function; the parameters in the plurality of third speech recognition subnetworks are updated according to the update direction.

In some embodiments, in combination with step 522, there may be multiple implementations for tuning parameters of the network model, and bayesian optimization is a more efficient tuning means in comparison. With reference to fig. 6B, after the initial error rates of the plurality of third speech recognition subnetworks are obtained, the adjustment direction of the next parameter can be determined with reference to the preset acquisition function of bayesian optimization; and determining the next parameter according to the adjustment direction of the next parameter, and performing multiple rounds of adjustment based on the performance of the next parameter, thereby determining the final parameter of the network model. For example, the collection function may be a reliability of availability (PoF) function, and the preset number of optimization rounds is 100 rounds.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: the invention provides a parameter adjusting method based on Bayesian optimization, and particularly determines the updating direction of parameters in a third voice recognition sub-network according to the initial error rate in a plurality of third voice recognition sub-networks and a preset acquisition function, so as to adjust parameters. The parameter adjusting mode of the Bayesian optimization is adjusted based on the last parameter adjusting result, so that the parameter can be quickly adjusted.

Optionally, as shown in fig. 5B, the step 5221 of determining the update directions of the parameters in the plurality of third speech recognition sub-networks according to the initial error rates of the plurality of third speech recognition sub-networks and the preset collection function includes:

step 52211, the electronic device processes the target parameter pair using a multivariate gaussian distribution function such that the target parameter pair follows the multivariate gaussian distribution.

Wherein the target parameter pair comprises a value of the target parameter and an initial error rate corresponding to the target parameter, the target parameter being each parameter in each of the plurality of third speech recognition subnetworks.

In some embodiments, to train the parameters in the third voice recognition subnetworks, the electronic device first needs to obtain a plurality of target parameter pairs for the target parameters, and make the plurality of target parameter pairs serve a multivariate gaussian distribution.

Step 52212, the electronic device, in combination with the multivariate gaussian distribution function, obtains the update direction of the target parameter under the condition of maximizing the preset collection function.

In some embodiments, under the condition that the plurality of target parameter pairs obey a multivariate gaussian distribution function, the electronic device determines an update direction of the target parameter by combining a value of the target parameter of each of the plurality of target parameter pairs in the multivariate gaussian distribution function and an initial error rate corresponding to the value of the target parameter under the condition that a preset acquisition function is maximized, and determines a next sampling point corresponding to the target parameter based on the update direction of the target parameter.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: steps 52211-52212 provide a method for determining the adjustment and update direction of parameters, and particularly provide the basis for the adjustment direction of parameters by a plurality of target parameter pairs, so that the adjustment speed is fast and the accuracy is higher.

Step 523, the electronic device determines an error rate of the optimized third speech recognition sub-network.

In some embodiments, in conjunction with step 5211, the electronic device determines an error rate of the optimized third speech recognition subnetwork based on the error rate prediction model.

And step 53, the electronic equipment determines the third voice recognition sub-network with the error rate meeting the preset condition as the target voice recognition sub-network in the plurality of third voice recognition sub-networks.

In some embodiments, the electronic device ranks the error rates of the plurality of third speech recognition subnetworks after deriving the error rates. And after the error rate meets a preset condition, determining the corresponding voice recognition sub-network as a final target voice recognition sub-network. The preset condition may be that the error rate is the lowest after sorting, or that the error rate is less than or equal to a target threshold. The preset condition is not limited by the present disclosure as long as an optimal target speech recognition sub-network is selected.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: as shown in steps 51 to 53, the second voice recognition super network is sampled a plurality of times to obtain a plurality of third voice recognition sub-networks, and a target voice recognition sub-network meeting the requirement is selected based on the error rate of the third voice recognition sub-networks. This implementation is not only easy to operate, but the error rate of the selected target speech recognition subnetwork will be lower, since the basis for the screening is the error rate of the third speech recognition subnetwork.

Optionally, as shown in fig. 5A, in the case of steps 521-523, step 53 determines the update directions of the parameters in the third speech recognition subnetworks according to the initial error rates of the third speech recognition subnetworks and a preset collection function, including:

and 531, determining the optimized third voice recognition sub-networks with the error rates meeting preset conditions in the optimized third voice recognition sub-networks as target voice recognition sub-networks.

In some embodiments, in combination with step 53, if the third speech recognition subnetwork is optimized, a target speech recognition subnetwork corresponding to an error rate meeting a preset requirement needs to be selected from the error rates corresponding to the optimized third speech recognition subnetwork.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: as can be seen from the above, after the error rate of the third speech recognition subnetwork is obtained, the third speech recognition subnetwork is optimized according to the error rate expression, the error rate of the optimized third speech recognition subnetwork is lower, and the error rate of the target speech recognition subnetwork selected from the optimized third speech recognition subnetwork is lower.

And step 15, retraining the target voice recognition sub-network by the electronic equipment to obtain a voice recognition model.

In some embodiments, after the target speech recognition subnetwork is obtained, the target speech recognition subnetwork is retrained to obtain a final speech recognition model.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: the constructed first voice recognition super network comprises a multi-layer network structure, each layer of network structure corresponds to a plurality of different combinations of values of search characteristics, and the search characteristics comprise the number of branches, network layer dimensions and channel selection dimensions; based on voice sample data, training the first voice recognition super network to obtain a second voice recognition super network, performing network search on the second voice recognition super network to obtain a target voice recognition sub-network, and training the target recognition sub-network to obtain a voice recognition model. The number of branches represents the number of parallel sub-networks in each layer of network structure, and the parallel sub-networks with different numbers correspond to different data processing capacities; the network layer dimension determines the complexity of the learned features, and the learning difficulty can be directly influenced; the channel number dimension determines the learning depth of each layer of network, so that the network structure can be more accurately expressed. Therefore, the number of branches, the network layer dimension and the channel selection dimension not only affect the complexity of the network structure, but also are directly related to the performance of the network structure. According to the method, the first voice recognition super network with lower complexity and better performance is established by determining the proper number of branches related to performance, the network layer dimension and the channel selection dimension, and then the voice recognition model with better performance can be obtained from the second voice recognition super network with lower complexity and better performance through network search. Moreover, the specific values of the number of branches, the network layer dimension and the channel selection dimension are suitable values determined from a plurality of possible values, and are not fixed, and the suitable values can obtain better network performance. That is, a network structure with better performance can be obtained from a plurality of possible network structures corresponding to a plurality of possible values. Compared with the prior art, each layer of network structure is a model with fixed parameters; the network structure design of the present disclosure is more reasonable, and brings better voice recognition effect.

Optionally, as shown in fig. 7, step 15 retrains the target speech recognition subnetwork to obtain a speech recognition model, including:

and step 71, the electronic equipment inputs the voice sample into the target voice recognition sub-network to obtain the recognition label of the voice sample.

Step 72, the electronic device will determine a loss value between the identification tag and the voice sample tag based on the loss function.

In some embodiments, the electronic device may calculate a difference (i.e., a loss value) between the identification tag of each voice sample and the voice sample tag of each voice sample. Illustratively, the loss function is a hybrid loss based on an additive angular interval maximum loss and a minimum hypersphere energy. Specifically, the loss function can be calculated by the following expression:

in the above expression, s is generally set to 30, m is generally set to 0.2, λ is set to 0.01, θ is an included angle between w and g, gi is an output characteristic of the voice sample data, yi is a voice sample label, N is the number of the voice samples, w is a learning parameter, and C is the number of users corresponding to all the voice samples.

And 73, the electronic equipment iteratively updates the parameters of the target speech recognition sub-network according to the loss value to obtain a speech recognition model.

In some embodiments, the electronic device obtains a loss value according to a difference between the recognition tag of the voice sample and the voice sample tag of the voice sample, and adjusts a parameter of the target voice recognition subnetwork according to the loss value; and training the adjusted target voice recognition sub-network by using the recognition label of the next voice sample and the voice sample label until the loss value is smaller than a preset threshold value, and obtaining a final voice recognition model. And with reference to fig. 6C, the electronic device creates a voice recognition super network according to the voice sample data, then obtains a target voice recognition sub-network based on the voice recognition super network, and finally retrains the target voice recognition sub-network to obtain a voice recognition model.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: steps 71-73 provide a method for training a target speech recognition sub-network to obtain a speech recognition model based on additive angle interval maximum loss and minimum hypersphere energy mixed loss, and by adopting the new loss function, the training of the target speech recognition sub-network can be deeper, and the speech recognition model with lower error rate can be obtained.

Illustratively, in combination with the above method for generating a speech recognition model, 3 different speech recognition models shown in table 1 are trained, and a combination of search feature values corresponding to each layer of network structure in each speech recognition model is trained.

TABLE 1

The 3 speech recognition models in Table 1 are combined with other existing speech recognition models (e.g., Dual attention, ARET-25, Fast RestNet-34, TDNN, E-TDNN, F-TDNN, D-TDNN-SS)

512. D-TDNN-SS 128 and AutoSpeech), and as can be seen from the combination of Table 2 and FIG. 8, the voice recognition models (SpeechNAS-1, SpeechNAS-2, SpeechNAS-3, SpeechNAS-4, SpeechNAS-5 and SpeechNAS-5) searched by the present disclosure can use less parameter amount and achieve higher accuracy. When a parameter amount of 3.1M is used, the speech recognition model of the present disclosure achieves a lower error rate (e.g., 1.14 vs. 1.22%); by adding a little more parameter (4.3M), the speech recognition model of the present disclosure can greatly reduce the error rate (e.g. 1.02vs 1.22%), and meanwhile, the speech recognition model of the present disclosure has very good generalization performance, and can still obtain very good accuracy on other data sets.

TABLE 2

Optionally, as shown in fig. 9, the method for generating a speech recognition model further includes:

step 91, the electronic device obtains a voice signal to be recognized.

And step 92, the electronic equipment identifies the model of the voice signal to be identified to obtain a voice tag corresponding to the voice signal.

In some embodiments, after the speech recognition model is obtained by the speech recognition model generation method, the speech recognition model may be used in a recognition scene of a speech signal, specifically, the speech signal to be recognized is input into the speech recognition model, and the speech recognition model directly inputs a speech tag corresponding to the speech signal to be recognized.

The technical scheme provided by the embodiment can at least bring the following beneficial effects: steps 91-92 provide a method for performing speech recognition on a speech signal to be recognized based on a speech recognition model, wherein the speech recognition model is created based on a proper number of branches, network layer dimension and channel selection dimension, and the number of branches, the network layer dimension and the channel selection dimension are directly related to the error rate of the speech recognition model recognition, so that the error rate corresponding to the result of the speech recognition model recognition based on the present disclosure is lower.

Illustratively, the present disclosure may be implemented by computer code:

algorithm: the SpeechNAS algorithmm// initial search space

Input, Dataset D ═ Dtrain { (Xi, yi) | i ═ 1, …, n }, search space a, the number of epochs n1 and the number of candidates n2 in BO, and hyper-parameters in the training// Input: data set D ═ { voice sample data Xi, voice sample label yi }, initial search space, Bayesian optimization search round number and candidate network number, and training hyper-parameter

Output, Optimal architecture a of low EERs on the evaluation set// Output: optimal network structure a well-behaved on verification set

///super network training

Construct a supernet based on the search space A; v/constructing a hyper-network according to a search space

Train the super using equation (5) and the loss function in equation (4) with SGD; training a hyper-network with random gradient descent and cross-soil moisture loss, and single-path uniform sampling

Architecture search// web search

Randomly explore n2 candudates a 0; // randomly training network candidates a0

Evaluate EERs for a0 with weight sharing; // sharing the error Rate of evaluation a0 with the use of weights

Add a0 and EERs into queue Q; // adding a0 and error rate to queue Q

Learn GP based on equalization (7); // learning Gaussian regression model

For i＝1，2，…，n1 do

Selecting new architecture ai by optimizing/acquiring new network architecture ai based on optimized acquisition function

Acquisition function α, ai ═ argmaxa α (ai-1; Q), Evaluate EERs for ai; // evaluating the error Rate of ai

Appendix ai and the EERs to queue Q; // concatenating ai and error Rate into queue Q

End

///target speech recognition subnetwork retraining

For a in an1 do

Train networks a with SGD and equalization (6); v/training subnetwork a Using random gradient descent

Save the best drawn model and evaluate the EER; v/saving the training model and evaluating the error rate

End

Return Optimal architecture a of low EERs and trailing modules. // return the optimal structure a for smaller error rate and the trained model.

The method provided by the embodiment of the present disclosure is described in detail above with reference to fig. 1 to 9. In order to realize the above functions, the speech recognition model generation device includes hardware structures and/or software modules for performing the respective functions, and these hardware structures and/or software modules for performing the respective functions may constitute one speech recognition model generation device. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for implementing the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The speech recognition model generation device may be divided into functional modules according to the above method examples, for example, the speech recognition model generation device may divide each functional module corresponding to each function, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation.

Hereinafter, a speech recognition model generation apparatus according to an embodiment of the present disclosure will be described in detail with reference to fig. 10. It should be understood that the description of the apparatus embodiments corresponds to the description of the method embodiments, and therefore, for brevity, details are not repeated here, since the details that are not described in detail may be referred to the above method embodiments.

Fig. 10 is a block diagram illustrating a logical structure of a speech recognition model generation apparatus according to an exemplary embodiment. Referring to fig. 10, the speech recognition model generation apparatus includes: an acquisition module 1010 and a processing module 1020. An obtaining module 1010 configured to obtain voice sample data; the voice sample data comprises a voice sample and a voice sample label; for example, in conjunction with fig. 1, the obtaining module 1010 may be configured to perform step 11. A processing module 1020 configured to construct a first speech recognition super-network, the first speech recognition super-network comprising a plurality of layers of network structures, each layer of network structures corresponding to a plurality of different combinations of search feature values, the search features comprising a number of branches, a network layer dimension, and a channel selection dimension; for example, in conjunction with fig. 1, processing module 1020 may be used to perform step 12. A processing module 1020 further configured to perform a training operation on the first speech recognition super-network based on the speech sample and the speech sample label, resulting in a second speech recognition super-network; the voice sample label is used as an expected identification value corresponding to the voice sample; for example, in conjunction with fig. 1, processing module 1020 may be used to perform step 13. A processing module 1020 further configured to perform a network search on the second speech recognition super-network to obtain a target speech recognition sub-network; the target voice recognition sub-network comprises a multi-layer network structure, and each layer of network structure corresponds to a combination of search characteristic values; for example, in conjunction with fig. 1, processing module 1020 may be used to perform step 14. The processing module 1020 is further configured to retrain the target speech recognition subnetwork to obtain the speech recognition model. For example, in conjunction with fig. 1, processing module 1020 may be used to perform step 15.

Optionally, the processing module 1020 is further configured to determine a number of multiple branches of each layer of the network structure in the first voice recognition super network, and construct a branch module of each layer of the network structure according to the number of each branch; for example, in conjunction with fig. 2, processing module 1020 may be used to perform step 21. The processing module 1020 is further configured to perform a first data processing on the characteristics of the branch modules of each layer of the network structure, so as to obtain a plurality of network layer dimensions of each layer of the network structure, where the first data processing includes at least one of merging processing, multi-order processing, or splicing processing. For example, in conjunction with fig. 2, processing module 1020 may be used to perform step 22.

Optionally, the processing module 1020 is further configured to determine the number of multiple branches of each layer of network structure in the first voice recognition super network, and construct multiple branch modules of each layer of network structure according to the number of each branch; for example, in conjunction with fig. 2, processing module 1020 may be used to perform step 21. The processing module 1020 is further configured to perform a second data processing on the channel selection layer of the branch module of each layer of the network structure, so as to obtain a plurality of channel selection dimensions of each layer of the network structure, where the second data processing includes full connection processing and/or matrix processing. For example, in conjunction with fig. 2, processing module 1020 may be used to perform step 23.

Optionally, the processing module 1020 is further configured to perform step a, in which multiple different combinations of the search feature values corresponding to each layer of the network structure in the first speech recognition super network are randomly sampled to obtain one combination of the search feature values corresponding to each layer of the network structure; obtaining a first voice recognition sub-network based on a combination of the corresponding search feature values of each layer of network structure; for example, in conjunction with fig. 4, processing module 1020 may be used to perform step 41. The processing module 1020 is further configured to perform step B of training the first speech recognition sub-network according to the speech samples and the speech sample labels, resulting in a second speech recognition sub-network; for example, in conjunction with fig. 4, processing module 1020 may be used to perform step 42. A processing module 1020 further configured to perform step C of synchronizing parameters in the second speech recognition subnetwork into the first speech recognition subnetwork; for example, in conjunction with fig. 4, processing module 1020 may be used to perform step 43. A processing module 1020 further configured to perform iterative execution of steps a-C to obtain a second speech recognition hyper-network; for example, in conjunction with fig. 4, processing module 1020 may be used to perform step 44.

Optionally, the processing module 1020 is further configured to train the first speech recognition sub-networks in multiple batches according to the speech samples and the speech sample labels, so as to obtain multiple second speech recognition sub-networks; wherein each batch of training employs a set of speech subsamples. For example, in conjunction with fig. 1, processing module 1020 may be used to perform step 421.

Optionally, the processing module 1020 is further configured to perform sampling processing on the second voice recognition super network for multiple times to obtain multiple third voice recognition subnetworks, where the sampling processing includes performing random sampling on multiple different combinations of the search feature values corresponding to each layer of the network structure in the second voice recognition super network, and obtaining the third voice recognition subnetwork according to one combination of the search feature values corresponding to each layer of the network structure obtained by the random sampling; for example, in conjunction with fig. 5A, processing module 1020 may be used to perform step 51. A processing module 1020 further configured to determine error rates for a plurality of third speech recognition subnetworks; for example, in conjunction with fig. 5A, processing module 1020 may be used to perform step 52. The processing module 1020 is further configured to determine, as the target speech recognition subnetwork, a third speech recognition subnetwork of the plurality of third speech recognition subnetworks whose error rate satisfies a preset condition. For example, in conjunction with fig. 5A, processing module 1020 may be used to perform step 53.

Optionally, the processing module 1020 is further configured to determine an initial error rate of a plurality of third voice recognition subnetworks; for example, in connection with 5A, processing module 1020 may be used to perform step 521. A processing module 1020 further configured to adjust parameters in the plurality of third speech recognition sub-networks based on the initial error rate to obtain a plurality of optimized third speech recognition sub-networks; for example, in conjunction with fig. 5A, processing module 1020 may be used to perform step 522. A processing module 1020 further configured to determine an error rate of the optimized third speech recognition subnetwork; for example, in conjunction with fig. 5A, processing module 1020 may be configured to perform step 523. The processing module 1020 is further configured to determine, as the target speech recognition subnetwork, a plurality of optimized third speech recognition subnetworks of the plurality of optimized third speech recognition subnetworks, which have an error rate satisfying a preset condition. For example, in conjunction with fig. 5A, the processing module 1020 may be used to perform step 531.

Optionally, the processing module 1020 is further configured to input a combination of the search feature values corresponding to each layer of the network structures of the plurality of third speech recognition sub-networks into the error rate prediction model, so as to obtain initial error rates of the plurality of third speech recognition sub-networks, where the initial error rates are used to characterize the capability of the third speech recognition sub-networks to recognize the speech samples. For example, in conjunction with fig. 5B, processing module 1020 may be used to perform step 5211.

Optionally, the processing module 1020 is further configured to perform parameter adjustment operation on the plurality of third speech recognition sub-networks to obtain a plurality of optimized third speech recognition sub-networks, where the parameter adjustment operation includes: determining updating directions of parameters in the third voice recognition sub-networks according to the initial error rates of the third voice recognition sub-networks and a preset collection function; the parameters in the plurality of third speech recognition subnetworks are updated according to the update direction. For example, in conjunction with fig. 5B, processing module 1020 may be used to perform step 5221.

Optionally, the processing module 1020 is further configured to process the target parameter pair by using a multivariate gaussian distribution function, so that the target parameter pair follows the multivariate gaussian distribution; the target parameter pair comprises a value of the target parameter and an initial error rate corresponding to the target parameter, the target parameter being each parameter in each of the plurality of third speech recognition sub-networks; for example, in conjunction with fig. 5B, processing module 1020 may be used to perform step 52211. The processing module 1020 is further configured to derive an update direction of the value of the target parameter in combination with the multivariate gaussian distribution function in case the preset acquisition function is maximized. For example, in conjunction with fig. 5B, processing module 1020 may be used to perform step 52212.

Optionally, the processing module 1020 is further configured to input the voice sample into the target voice recognition subnetwork, so as to obtain a recognition tag of the voice sample; for example, in conjunction with fig. 7, processing module 1020 may be used to perform step 71. A processing module 1020 further configured to determine a loss value between the identification tag and the voice sample tag based on a loss function; for example, in conjunction with fig. 7, processing module 1020 may be used to perform step 72. The processing module 1020 is further configured to iteratively update parameters of the target speech recognition subnetwork based on the loss value to obtain a speech recognition model. For example, in conjunction with fig. 7, processing module 1020 may be used to perform step 73.

Optionally, the processing module 1020 is further configured to obtain a speech signal to be recognized; for example, in conjunction with fig. 9, processing module 1020 may be used to perform step 91. The processing module 1020, further configured to input the speech signal to be recognized into the speech recognition model generated by the apparatus according to any of claims 1-11, to obtain a speech sample label corresponding to the speech signal to be recognized. For example, in conjunction with fig. 9, a processing module 1020 may be used to perform step 92.

Of course, the path selection device provided in the embodiment of the present disclosure includes, but is not limited to, the above modules, for example, the path selection device may further include the storage module 1030. The storage module 1030 may be configured to store the program code of the write path selection apparatus, and may also be configured to store data generated by the write path selection apparatus during operation, such as data in a write request.

Fig. 11 shows a schematic diagram of a possible structure of the electronic device involved in the above embodiment. As shown in fig. 11, the electronic device 110 includes a processor 1101 and a memory 1102.

It is understood that the electronic device 110 shown in FIG. 11 can implement all of the functions of the speech recognition model generation method described above. The functions of the respective modules in the speech recognition model generation apparatus described above may be implemented in the processor 1101 of the electronic device 110. The storage module of the speech recognition model generation apparatus corresponds to the memory 1102 of the electronic device 110.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the speech recognition model generation methods provided by the disclosed method embodiments.

In some embodiments, the electronic device 110 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, touch display screen 1105, camera 1106, audio circuitry 1107, positioning component 1108, and power supply 11011.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, the memory 1102, and the peripheral device interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other speech recognition model generation devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 6G), Wireless local area networks, and/or Wi-Fi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1104 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1105 may be one, providing a front panel of the electronic device 110; the Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. In general, a front camera is provided on the front panel of the speech recognition model generation apparatus, and a rear camera is provided on the rear panel of the speech recognition model generation apparatus. The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the electronic device 110. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

The positioning component 1108 is used to locate a current geographic Location of the electronic device 110 to implement navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union galileo System.

The power supply 1109 is used to supply power to various components in the electronic device 110. The power supply 11011 may be alternating current, direct current, disposable or rechargeable. When the power supply 1109 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 110 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensors, gyroscope sensors, pressure sensors, fingerprint sensors, optical sensors, and proximity sensors.

The acceleration sensor may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 110. The gyro sensor may detect a body direction and a rotation angle of the electronic device 110, and the gyro sensor may cooperate with the acceleration sensor to acquire a 3D motion of the user on the electronic device 110. The pressure sensors may be disposed on a side bezel of the electronic device 110 and/or underlying the touch display screen 1106. When the pressure sensor is disposed on the side frame of the electronic device 110, a holding signal of the user to the electronic device 110 may be detected. The fingerprint sensor is used for collecting fingerprints of users. The optical sensor is used for collecting the intensity of ambient light. Proximity sensors, also known as distance sensors, are typically provided on the front panel of the electronic device 110. The proximity sensor is used to capture the distance between the user and the front of the electronic device 110.

The present disclosure also provides a computer-readable storage medium having instructions stored thereon, which, when executed by a processor of a speech recognition model generation apparatus, enable the speech recognition model generation apparatus to perform the speech recognition model generation method provided by the present disclosure.

The embodiment of the present disclosure also provides a computer program product containing instructions, which when run on a speech recognition model generation apparatus, causes the speech recognition model generation apparatus to execute the speech recognition model generation method provided by the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a speech recognition model, comprising:

acquiring voice sample data; the voice sample data comprises the voice sample and the voice sample tag;

constructing a first voice recognition super network, wherein the first voice recognition super network comprises a plurality of layers of network structures, each layer of network structure corresponds to a plurality of different combinations of values of search features, and the search features comprise the number of branches, network layer dimensions and channel selection dimensions;

based on the voice sample and the voice sample label, training operation is carried out on the first voice recognition hyper-network to obtain a second voice recognition hyper-network; the voice sample label is used as an expected identification value corresponding to the voice sample;

performing network search on the second voice recognition super network to obtain a target voice recognition sub network; the target speech recognition sub-network comprises the multilayer network structure, and each layer of network structure corresponds to a combination of the search characteristic values;

and retraining the target speech recognition sub-network to obtain a speech recognition model.

2. The method of claim 1, further comprising:

determining the number of a plurality of branches of each layer of network structure in the first voice recognition hyper network, and constructing a branch module of each layer of network structure according to the number of each branch;

and performing first data processing on the characteristics of the branch modules of each layer of the network structure to obtain a plurality of network layer dimensions of each layer of the network structure, wherein the first data processing comprises at least one of merging processing, multi-order processing or splicing processing.

3. The method of claim 1, further comprising:

determining the number of a plurality of branches of each layer of network structure in the first voice recognition hyper network, and constructing a plurality of branch modules of each layer of network structure according to the number of each branch;

and performing second data processing on the channel selection layer of the branch module of each layer of the network structure to obtain a plurality of channel selection dimensions of each layer of the network structure, wherein the second data processing comprises full connection processing and/or matrix processing.

4. The method of claim 1, wherein said performing a training operation on the first speech recognition super-network based on the speech samples and the speech sample labels to obtain a second speech recognition super-network comprises:

step A: randomly sampling a plurality of different combinations of the search feature values corresponding to each layer of the network structure in the first voice recognition super network to obtain a combination of the search feature values corresponding to each layer of the network structure; obtaining the first voice recognition sub-network based on a combination of the search feature values corresponding to each layer of the network structure;

and B: training the first voice recognition sub-network according to the voice sample and the voice sample label to obtain a second voice recognition sub-network;

and C: synchronizing parameters in the second speech recognition subnetwork into the first speech recognition subnetwork;

and (D) iteratively executing the step A to the step C to obtain the second voice recognition hyper-network.

5. The method of claim 4, wherein the speech samples comprise a plurality of groups of speech subsamples, and wherein step B comprises:

and training the first voice recognition sub-network in multiple batches according to the voice samples and the voice sample labels to obtain multiple second voice recognition sub-networks, wherein each batch of training adopts a group of voice sub-samples.

6. The method of claim 1, wherein said performing a network search on the second speech recognition super network to obtain a target speech recognition sub-network comprises:

sampling the second voice recognition super network for multiple times to obtain a plurality of third voice recognition sub-networks, wherein the sampling comprises randomly sampling a plurality of different combinations of the search feature values corresponding to each layer of network structure in the second voice recognition super network, and obtaining the third voice recognition sub-networks according to one combination of the search feature values corresponding to each layer of network structure obtained by random sampling;

determining error rates for a plurality of third speech recognition subnetworks;

and determining a third voice recognition sub-network with an error rate meeting a preset condition from the plurality of third voice recognition sub-networks as the target voice recognition sub-network.

7. A speech recognition model generation apparatus, comprising:

an acquisition module configured to acquire voice sample data; the voice sample data comprises the voice sample and the voice sample tag;

a processing module configured to construct a first speech recognition super-network, the first speech recognition super-network comprising a plurality of layers of network structures, each layer of network structure corresponding to a plurality of different combinations of search feature values, the search features comprising a number of branches, a network layer dimension, and a channel selection dimension;

the processing module is further configured to perform a training operation on the first speech recognition super-network based on the speech sample and the speech sample label, resulting in a second speech recognition super-network; the voice sample label is used as an expected identification value corresponding to the voice sample;

the processing module is further configured to perform a network search on the second speech recognition super network to obtain a target speech recognition sub-network; the target speech recognition sub-network comprises the multilayer network structure, and each layer of network structure corresponds to a combination of the search characteristic values;

the processing module is further configured to retrain the target speech recognition subnetwork to obtain a speech recognition model.

8. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition model generation method of any one of claims 1-6.

9. A computer-readable storage medium having instructions stored thereon, wherein the instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech recognition model generation method of any of claims 1-6.

10. A computer program product comprising computer instructions for implementing a speech recognition model generation method according to any one of claims 1-6 when the computer instructions are executed by an electronic device.