CN115983372A

CN115983372A - Neural network training method and device, computing equipment and storage medium

Info

Publication number: CN115983372A
Application number: CN202211517866.0A
Authority: CN
Inventors: 赵娟萍
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-18

Abstract

The application relates to a neural network training method, a neural network training device, a computing device and a storage medium, wherein the method comprises the following steps: acquiring a first neural network model and a second neural network model; according to the first neural network model and the second neural network model, carrying out joint training of double-network knowledge distillation on the sub-network dimensions of the super network and the network layer dimensions of the sub-network to obtain a target neural network; wherein the target neural network is a trained hyper-network. By adopting the application, the network performance is improved through the dual-network joint knowledge distillation.

Description

Neural network training method and device, computing equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a neural network training method, apparatus, computing device, and storage medium.

Background

Knowledge distillation is a common method for model compression, and a small lightweight model is constructed and trained by using supervision information of a large model with better performance so as to achieve better performance and precision.

Disclosure of Invention

The application provides a neural network training method, a network searching method, a data identification method, a neural network training device, a network searching device, a data identification device, a computing device and a storage medium.

According to an aspect of the present application, there is provided a neural network training method, including:

acquiring a first neural network model and a second neural network model;

according to the first neural network model and the second neural network model, carrying out joint training of double-network knowledge distillation on the sub-network dimensions of the super network and the network layer dimensions of the sub-network to obtain a target neural network; wherein the target neural network is a trained hyper-network.

According to another aspect of the present application, there is provided a network search method including:

initiating a search request, wherein the search request is used for representing an operation request for searching a target neural network under a calculation force constraint condition, and the target neural network is a trained super network obtained by adopting any one of the above items;

and responding to the search request to obtain the sub-networks meeting the computing power constraint condition in the target neural network.

According to an aspect of the present application, there is provided a data identification method, including:

inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by adopting any one of the above;

identifying the data according to the target neural network to obtain target data;

wherein the data includes: at least one of image data, video data, text data, and voice data.

determining sub-networks in the target neural network which meet the computational force constraint condition;

identifying the data according to the sub-network to obtain target data;

According to another aspect of the present application, there is provided a neural network training apparatus, including:

the acquisition unit is used for acquiring a first neural network model and a second neural network model;

the combined training unit is used for carrying out combined training of double-network knowledge distillation on the dimension of the super network in the sub-network of the super network and the dimension of the network layer of the sub-network according to the first neural network model and the second neural network model to obtain a target neural network; wherein, the target neural network is a trained hyper-network.

According to another aspect of the present application, there is provided a network search apparatus, including:

the searching unit is used for initiating a searching request, wherein the searching request is used for representing an operation request for searching a target neural network under a calculation force constraint condition, and the target neural network is a trained hyper-network obtained by adopting any one of the above items;

and the response unit is used for responding to the search request and obtaining the sub-networks meeting the computational force constraint condition in the target neural network.

According to another aspect of the present application, there is provided a data recognition apparatus including:

the first input unit is used for inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by adopting any one of the above steps;

the first identification unit is used for identifying the data according to the target neural network to obtain target data;

the second input unit is used for inputting data into a target neural network, wherein the target neural network is a trained super network obtained by adopting any one of the above methods;

a determining unit, configured to determine a sub-network satisfying a computational force constraint condition in the target neural network;

the second identification unit is used for identifying the data according to the sub-network to obtain target data;

According to another aspect of the present application, there is provided a computing device comprising: and the processor is used for calling and running the computer program from the memory so as to enable the computing equipment to execute the method provided by any embodiment of the application.

According to another aspect of the present application, there is provided a computer-readable storage medium storing a computer program which, when executed by an apparatus, causes the apparatus to perform any one of the embodiments of the present application.

By adopting the method, the first neural network model and the second neural network model can be obtained, and the combined training of the two-network knowledge distillation of the super network in the sub-network dimension of the super network and the network layer dimension of the sub-network is carried out according to the first neural network model and the second neural network model, so that the target neural network is obtained; wherein the target neural network is a trained hyper-network. Because the knowledge distillation is combined by the double networks, the knowledge distillation method has the performance advantages of the two networks, and therefore, the network performance is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIGS. 1-3 are schematic diagrams of an example of Once-For-All in the related art;

FIG. 4 is a schematic diagram of an Once-For-All application deployment For searching;

FIG. 5 is a schematic diagram of Once-For-All hyper-network training in the related art;

FIG. 6 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a neural network training method according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a network searching method according to an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of a data identification method according to an embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of a data identification method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of the knowledge distillation of a two-wire combination in an example of use according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a hybrid computing unit in an application example according to an embodiment of the present application;

13-14 are schematic diagrams of the composition of a neutron network in an example application according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a component structure of a neural network training device according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a network search apparatus according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a component structure of a data recognition device according to an embodiment of the present application;

FIG. 18 is a block diagram of a data recognition device according to an embodiment of the present application;

fig. 19 is a block diagram of an electronic device for implementing the neural network training method/network searching method/data recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of any one or more of a plurality, for example, including at least one of a, B, C, and may mean including any one or more elements selected from the group consisting of a, B, and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, technical terms and basic concepts related to the embodiments of the present application will be briefly described below.

1. Knowledge distillation: the method is a common method for model compression, and is different from pruning and quantification in model compression, knowledge distillation is to train a small lightweight model (namely a student model) by using supervision information of a large model (namely a teacher model) with better performance so as to achieve better performance and precision. Wherein, the small model can also be called as a small network, and correspondingly, the student model can also be called as a student network; the large model may also be referred to as a large network and, correspondingly, the teacher model may also be referred to as a teacher network. The process by which the student model can learn and inherit the supervisory information from the teacher model is referred to as knowledge distillation.

2. Super networks (or simply super networks): in the search technology of the neural network structure, a search space can be represented by using a super network, the super network includes a plurality of blocks (blocks), and a sequential single connection mode can be adopted between the blocks, that is: one of the blocks may be connected to an upstream block and a downstream block, respectively. A block may also include multiple layers. At least one layer possibly included by the target neural network can be obtained by searching in the super network, so that the optimal target neural network is obtained.

3. Sub-networks: if the super network is referred to as a large network, the large network is divided into a plurality of small networks, each of which is a sub-network in the large network, in other words, a large network (i.e., a super network) can be obtained by combining the plurality of small networks.

4. Loss function: the method is a function for measuring the prediction error degree, and for how to evaluate the model training effect in the model (or called neural network model) training process, a loss function needs to be defined in advance, so that whether the model training effect is optimal or not is judged, derivation is carried out in the back propagation of the loss function, the gradient is continuously optimized and is updated, and the purpose is to minimize the loss function, so that the model training effect is optimal.

Once and For All (Once-For-All, OFA) is a new neural network search solution proposed from the perspective of convenience of neural network deployment. The scheme designs an Once-For-All network (also called an Once-For-All ultra network or an OFA ultra network), which can be directly deployed under different architecture configurations to share the training cost. Inference can be performed by selecting only a portion of the subnetworks in the Once-For-All hyper-network. The OFA hyper-network can flexibly support different depths, widths, kernel sizes and resolutions without retraining. An example of a simple OFA, as shown in fig. 1: after training a network of OFAs, multiple private sub-nets (specialized sub-nets) may be deployed. For example, a private subnet for cloud AI, a private subnet for mobile AI, a private subnet for micro AI, etc. These private subnets can be deployed directly (direct deployment) without retraining (no retrain). As shown in fig. 2, the design cost of using OFA is not substantially changed with the number of deployment scenarios. Compared with the common training deployment scheme, the design cost is obviously reduced. As shown in fig. 3, the horizontal axis represents the actual measured delay (latency) of the searched network in a certain NPU, and the vertical axis represents the classification accuracy in the image network (ImageNet) data set. The closer the searched network performance is to the upper left corner position, the better the performance is. The OFA can be used for training once to obtain a plurality of (Train once, get many) networks. The number of times of training of the conventional training method, such as MobileNetV3, is the same as the number of networks, for example, four times of training results in four (Train four times, get four) networks. In comparison, OFA performs much better.

An exemplary OFA flow is shown in FIG. 4. First, a corresponding dynamic network needs to be constructed from the original static network (S401). The constructed dynamic network may include the number of channels per layer in the network, convolution kernel size, network depth, input image resolution, etc. Then, a super network (SuperNet) is trained by inputting training data in the training set or a true value (GT) corresponding to the data in the training set (S402), where the super network has information such as the maximum channel number and the convolution kernel size. After the hyper-network is trained, a sub-network is sampled from the hyper-network randomly, and the sub-network is encoded to obtain sub-network structure encoding (i.e. sub-network structure configuration), so as to obtain a precision value corresponding to the sub-network, namely a sample is generated by a precision predictor (S403). Based on the training data (i.e., samples) of the precision predictor sampled in the above steps, a simple precision predictor model (or called precision prediction model) such as a Multi-layer Perceptron Machine (MLP) is constructed (S404), and the precision predictor model is trained until convergence (S405). Then, the subnet configuration condition satisfying the constraint condition may be searched under a given computational constraint, such as the subnet configuration condition satisfying the constraint condition may be searched under a given constraint, for example, floating-Point Operation (flo) (S406). Finally, according to the searched Configuration of the sub-network structure, a corresponding sub-network structure and weight are obtained (S407), i.e., an Optimal Configuration (Output) is Output.

In the process of OFA training the super network, if a sub network structure is randomly searched each time to train. Training all subnetworks to perform as well as a large network under the same large network model framework presents a significant challenge. FIG. 5 shows the training process for an OFA super network. Typically, the performance of small networks will degrade to a different degree than the performance of large networks. The training process may include: first, the number of iterations is set to 1 (S501). A batch of data sets required for training is acquired (S502). An active subnetwork is randomly sampled (S503) and trained with the data set (S504). It is judged whether all the data (data set for training) are covered (S505). Wherein, covering all data may include: each generation of training (epoch) needs to traverse all data in the training set, i.e. all data need to be propagated to the network at least once in the forward and backward direction during the training process. If all data is not covered, the process returns to the step S501. If all the data are covered, it is determined whether the number of iterations is less than the maximum value (S506). If the iteration number is smaller than the maximum value, the iteration number is +1, otherwise, the process is ended. The mode of calculating power consumption according to the NPU internal circuit signal turnover frequency needs to acquire a large number of circuit signal turnover frequencies of different network structures under different input sizes, so that the design efficiency of a software-side AI network is limited.

In summary, most Neural network Search (NAS) methods tend to Search for a target only for a specific device or a platform with specific resource constraints. For different devices, it is often necessary to train from scratch on the device. Such a method is very poor in expansibility and too high in calculation cost, so from this perspective, the OFA super network is expected to decouple the training and searching processes, so that an OFA super network (SuperNet) supporting different architecture configurations can be trained, and a special sub-network can be obtained without additional training by selecting a sub-network from the OFA super network. However, in the training process of the original OFA hyper-network, if a subnet structure is randomly searched for training each time, it is more challenging to train all subnets to have the same performance as the large network under the same large network model framework, and generally, the performance of the subnets is reduced to a different degree compared with the performance of the large network.

The embodiment of the application provides a neural network training method, a plurality of teacher models are fully utilized, at least one of the teacher models is a dynamic network comprising a hybrid computing unit, other teacher models can adopt an original basic network, the performance of a sub-network can be boosted through the plurality of teacher models, and dynamic knowledge distillation is realized from two dimensions, namely the dimension of the sub-network and each network layer in the sub-network.

Fig. 6 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present application, where the distributed cluster system is an example of a cluster system, and exemplarily describes that data processing can be performed by using the distributed cluster system, and the present application is not limited to knowledge distillation on a single machine or multiple machines, and the accuracy of knowledge distillation can be further improved by using distributed processing. As shown in fig. 6, the distributed cluster system 600 includes a plurality of nodes (e.g., a server cluster 601, a server 602, a server cluster 603, a server 604, and a server 605, where the server 605 may further connect to electronic devices, such as a handset 6051 and a desktop 6052), and the plurality of nodes and the connected electronic devices may jointly perform one or more data processing tasks. Optionally, the plurality of nodes in the distributed cluster system may perform model training related to knowledge distillation by using a data parallel relationship, and then the plurality of nodes may perform the model training based on the same model training manner, and if the plurality of nodes in the distributed cluster system use a model training manner related to model parallel, then the plurality of nodes may perform model training related to knowledge distillation based on different training manners. Optionally, after each round of model training is completed, data exchange (e.g., data synchronization) may be performed between multiple nodes.

According to an embodiment of the present application, a neural network training method is provided, and fig. 7 is a schematic flow chart of the neural network training method according to the embodiment of the present application, which may be applied to a neural network training device, for example, the device may be deployed in an electronic device (such as a terminal or a server) or other processing equipment in a single machine, multiple machines or a cluster system, and may implement processes such as knowledge distillation. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 7, the method is applied to any node or electronic device, such as a mobile phone or a desktop, in the cluster system shown in fig. 6, and includes:

s701, obtaining a first neural network model and a second neural network model.

In some examples, the first neural network model may be a static network and the second neural network model may be a dynamic network including hybrid computational units. Specifically, the first neural network model and the second neural network model may be teacher models, the first neural network model is not limited to a teacher model, and may be teacher models, and the second neural network model is not limited to a teacher model, and may be teacher models. It is noted that at least one teacher model is a dynamic network comprising hybrid computing units, and other teacher models may use the original underlying network, i.e. the static network.

S702, performing combined training of double-network knowledge distillation on the dimension of the super network in the sub-network of the super network and the dimension of the network layer of the sub-network according to the first neural network model and the second neural network model to obtain a target neural network; wherein the target neural network is a trained hyper-network.

In some examples, a first loss function may be obtained in the forward propagation of the first neural network model, a second loss function may be obtained in the forward propagation of the second neural network model, and a third loss function corresponding to the joint training for the two-net knowledge distillation may be obtained based on the first loss function and the second loss function. And performing joint training of double-net knowledge distillation on the dimension of the super network in a sub network and the dimension of a network layer of the sub network according to the back propagation of the third loss function, so as to obtain the target neural network.

By adopting the embodiment of the application, a plurality of teacher models are fully utilized, at least one teacher model in the teacher models is a dynamic network comprising a hybrid computing unit, other teacher models can adopt an original basic network, the performance of the sub-network can be boosted by the plurality of teacher models, dynamic knowledge distillation is realized from the dimensionality of the sub-network and the dimensionality of the network layer of the sub-network, and therefore the performance reduction of the sub-network is avoided.

In one possible implementation, each sub-network in the target neural network has the same processing performance as the target neural network, so that the processing performance as the target neural network can be achieved and the processing speed is higher when the sub-networks are used for data recognition and other application scenarios.

In one possible implementation, the hybrid computation unit is a hybrid unit comprising at least two computation layers. Feature data is input into the hybrid computing unit, fusion processing is performed in the hybrid computing unit through the at least two computation sublayers (the at least two computation sublayers may include convolution layers using different convolution kernels or different convolution operation modes), a fusion processing result may be obtained, and the fusion processing result is used as an output of the hybrid computing unit. As more features can be reserved in the fusion processing result, a more accurate target neural network can be trained.

Fig. 8 is a schematic flowchart of a network searching method according to an embodiment of the present application, where the method includes:

s801, initiating a search request, wherein the search request is used for representing an operation request for searching a target neural network under a calculation force constraint condition, and the target neural network is a trained super network obtained by adopting any one of the above.

S802, responding to the search request, and obtaining the sub-networks meeting the calculation force constraint condition in the target neural network.

In some examples, subnet configuration information corresponding to the subnet may be obtained, and the subnet may be adapted for different hardware processing platforms according to the subnet configuration information. The subnet configuration information includes subnet structure and subnet weights. Wherein the sub-network structure may include at least one of the number of channels, convolution kernel size, and network depth of each network layer constituting the sub-network.

By adopting the embodiment of the application, the sub-network meeting the calculation force constraint condition can be searched from the target neural network, and the accuracy and the calculation force of the sub-network are better, so that the terminal-side deployment is more facilitated.

Fig. 9 is a schematic flowchart of a data identification method according to an embodiment of the present application, where the method may be applied to a data identification apparatus, and the method includes:

and S901, inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by adopting any one of the above.

S902, identifying the data according to the target neural network to obtain target data; wherein the data includes: at least one of image data, video data, text data, and voice data.

By adopting the embodiment of the application, the target neural network is deployed at the terminal side, and the data is identified according to the target neural network, so that more accurate target data can be obtained, and the application scene is not limited to at least one of image data identification, video data identification, text data identification and voice data identification.

Fig. 10 is a schematic flowchart of a data identification method according to an embodiment of the present application, where the method may be applied to a data identification apparatus, and the method includes:

s1001, inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by adopting any one of the above;

s1002, determining a sub-network meeting the calculation force constraint condition in the target neural network;

s1003, identifying data according to the sub-network to obtain target data; wherein the data includes: at least one of image data, video data, text data, and voice data.

By adopting the embodiment of the application, the sub-network determined based on the target neural network is deployed at the terminal side, and the data is identified according to the sub-network, so that more accurate target data can be obtained.

In an application example adopting the inventive concept of the above embodiment, aiming at the problem of performance degradation of the sub-network, the performance improvement of the sub-network can be assisted by knowledge distillation of the joint of the two networks. The dual-network joint knowledge distillation is to adopt the performance improvement of two teacher network power-assisted sub-networks. One of the teacher networks is an original static network model; another teacher network employs a network model based on a hybrid computing unit.

In an application example adopting the inventive concept of the above embodiment, aiming at the problem of performance degradation of the sub-network, the performance improvement of the sub-network can be assisted by knowledge distillation of the joint of the two networks. The dual-network joint knowledge distillation is to adopt the performance improvement of two teacher network power-assisted sub-networks. One of the teacher networks is an original static network model; another teacher network employs a network model based on hybrid computing units. Considering that different teacher networks have different contributions to the sub-networks at different network layers, in the knowledge distillation of the dual-network combination, the contribution rate of the current teacher model to different network layers in the current sub-network can be further obtained by combining the sub-networks and the network layers through gradient difference.

In the network model based on the hybrid computing unit, the ultra-network can obtain relatively high precision performance, but if a large number of hybrid computing units are adopted in the ultra-network, the computing power of the network is very high, so that the network model based on the hybrid computing unit is used as an additional teacher network to improve the performance of the sub-network together with the original static network, thereby not only ensuring the precision performance of the sub-network, but also avoiding the over-high computing power of the sub-network, in other words, under the condition of ensuring the computing power advantage of the sub-network, the precision performance of the sub-network can be obviously improved.

Fig. 11 is a schematic diagram of knowledge distillation of a dual-network combination in an application example according to an embodiment of the present application, and as shown in fig. 11, the dual-network combination mainly includes a first neural network model 1101 and a second neural network model 1102, and a target neural network (the target neural network is a trained super network) is obtained by performing a combined training of the dual-network knowledge distillation on sub-network dimensions of the super network through the first neural network model 1101 and the second neural network model 1102. The first neural network Model 1101 may be a first Teacher network Model based on a Static reference network Model (Static base Model), and the second neural network Model 1102 may be a Teacher Model (Mixed Cell based Teacher Model) based on a hybrid computing unit, and the Teacher Model based on the hybrid computing unit is marked as a second Teacher network Model for distinguishing from the first Teacher network Model. Considering that the first teacher network model is the original static network model, no further description is given here, for the second teacher network model, compared with the original static network model, the second teacher network model mainly replaces the down-sampling module in the original static network model with the hybrid computing unit, and compared with a single down-sampling module, because the hybrid computing unit can keep semantic information of more images while ensuring that the size of the output feature map is reduced, the precision expression of the sub-network in the target neural network is improved, thereby avoiding the performance reduction of the sub-network.

As shown in fig. 12, a search method for a subnet in an extranet based on two teacher network models, that is, a first teacher network model and a second teacher network model, in a process of assisting the performance improvement of the subnet in the extranet training, includes: acquiring Training data (Training data), constructing a Baseline model (base model), training to obtain an ultra-network (one-for-all super Net) which is trained for multiple deployments Once, searching under computational constraint (or limitation), retraining a searched model (train the searched model), and reasoning Final model (Final model for reference).

For the hybrid computing unit, fig. 12 is a schematic diagram of a hybrid computing unit in an application example according to an embodiment of the present application, and as shown in fig. 12, the hybrid computing unit is a hybrid computing unit including at least two computation layers, specifically, a fusion process may be performed in the hybrid computing unit through at least two computation layers (the at least two computation layers may include convolution layers that adopt different convolution kernels or different convolution operation manners), for example, a plurality of optional operator fusion operations are performed between an input and an output of the hybrid computing unit (an optional fusion manner of the fusion operation may be that an output feature map of each optional operator is weighted and summed to obtain a final output), and a fusion processing result is finally obtained and is used as an output of the hybrid computing unit.

Based on the knowledge distillation schematic diagram of the above-mentioned two-network combination shown in fig. 12, the process of implementing knowledge distillation may include the following:

1) Determining dynamic search variables, such as the number of channels per layer in the super network;

2) Constructing a dynamic ultra network, and determining a sampling mode of a sub network in the ultra network;

3) Training two teacher network models, namely a first teacher network model and a second teacher network model, so as to improve the performance of a sub-network in the super-network training assisted by the two teacher network models;

in some examples, the input picture may be marked as I, the loss functions obtained through the forward propagation calculation of the first teacher network model and the second teacher network model are L1 and L2, respectively, and the regularization term of the loss function of the knowledge distillation is a1 × L1+ a2 × L2. The a1 and the a2 are regularization term coefficients corresponding to the first teacher network model and the second teacher network model respectively, and the regularization term coefficients can be set through practical experience.

Further, as illustrated in fig. 13 to 14, the sub-network number and the network layer number may be represented by two dimensions i and j, respectively, for the first teacher network model, for the j-th layer of the ith sub-network, it may use formula (1) to represent a1 (i, j) for the regular term contribution size on the loss function, and for the j-th layer of the ith sub-network, it may use formula (2) to represent a2 (i, j) for the second teacher network model:

a1(i,j)＝cos(dL1(i,j)/dw(i,j)，dL(i,j)/dw(i,j)) (1)

a2(i,j)＝cos(dL2(i,j)/dw(i,j)，dL(i,j)/dw(i,j)) (2)

in equations (1) - (2), w (i, j) represents trainable parameters such as weights (weight) in the convolutional layer. Assuming that the original loss function is L, d () represents gradient calculation, cos () represents cosine calculation, so that the vector difference between two gradient optimization vectors can be measured through cosine calculation, and the contribution degree of the cosine calculation to the sub-network optimization is evaluated.

Specifically, in the knowledge distillation process, when training an Once-For-All ultra-net, the value of the optimization target of the ith sub-network is as follows: min sigma _j L (i, j) + a1 (i, j) × L1 (i, j) + a2 (i, j) × L2 (i, j), wherein L (i, j) is a loss function of a jth network layer of an ith sub-network in the Once-For-All hyper-network, a1 (i, j) is a regularization term coefficient corresponding to a jth network layer of the ith sub-network in the first teacher network model, L1 (i, j) is a loss function corresponding to a jth network layer of the ith sub-network in the first teacher network model, a2 (i, j) is a regularization term coefficient corresponding to a jth network layer of the ith sub-network in the second teacher network model, and L2 (i, j) is a loss function corresponding to a jth network layer of the ith sub-network in the second teacher network model.

4) The sampling precision predictor is used for training required data, for example, 5000 data pairs are selected, and the expression form of the data pairs can be [ sub-network structure coding, test precision ].

5) Training a precision predictor, which can be selected (Multi-Layer Perceptron, MLP).

6) Under the constraint of given computational power, for example, according to given calculated quantities (FLOPs) and/or given parameter quantities (Params), the optimal network structure configuration can be finally searched, so that the optimal sub-network is generated, and the weight of the sub-network is extracted. The FLOPs are mainly used for representing the length of the calculation time, and the Params is mainly used for representing the space complexity, such as the size of occupied video memory.

By adopting the application example, the accuracy performance of the sub-network in the Once-For-All framework can be improved, the sub-network with better accuracy and calculation power can be generated, and the accuracy performance of the sub-network can be obviously improved under the condition of ensuring the calculation power advantage of the sub-network, so that the deployment at the terminal side is facilitated, the identification accuracy of data identification scenes is improved, and the identification efficiency is improved.

It should be noted that the above examples may be combined with various possibilities in the embodiments of the present application, and are not described herein again.

According to an embodiment of the present application, there is provided a neural network training device, fig. 15 is a schematic structural diagram of a composition of the neural network training device according to the embodiment of the present application, as shown in fig. 15, the neural network training device includes: an obtaining unit 1501, configured to obtain a first neural network model and a second neural network model; a joint training unit 1502, configured to perform joint training of dual-net knowledge distillation on the subnetwork dimension of the super network and the network layer dimension of the subnetwork according to the first neural network model and the second neural network model, so as to obtain a target neural network; wherein, the target neural network is a trained hyper-network.

In one possible implementation, the first neural network model is a static network and the second neural network model is a dynamic network comprising hybrid computational units.

In one possible implementation, the joint training unit 1502 includes: a first loss subunit, configured to obtain a first loss function in forward propagation of the first neural network model; a second loss subunit for obtaining a second loss function in the forward propagation of the second neural network model; a third loss subunit, configured to obtain, according to the first loss function and the second loss function, a third loss function corresponding to joint training for the dual-net knowledge distillation; and the training subunit is used for performing joint training of dual-network knowledge distillation on the dimension of the super network in the sub network of the super network and the dimension of the network layer of the sub network according to the back propagation of the third loss function to obtain the target neural network.

In one possible implementation, each sub-network in the target neural network is a sub-network having the same processing performance as the target neural network.

In one possible implementation, the hybrid computing unit is a hybrid unit including at least two computing layers; and inputting the characteristic data into the hybrid computing unit, performing fusion processing in the hybrid computing unit through the at least two operator layers to obtain a fusion processing result, and taking the fusion processing result as the output of the hybrid computing unit.

In one possible implementation, the at least two computation layers include: convolution layers with different convolution kernels or different convolution operation modes are adopted.

According to an embodiment of the present application, there is provided a network search apparatus, fig. 16 is a schematic diagram of a configuration of the network search apparatus according to the embodiment of the present application, and as shown in fig. 16, the network search apparatus includes: a search unit S1601, configured to initiate a search request, where the search request is used to represent an operation request for searching a target neural network under a computational constraint, where the target neural network is a trained super network obtained by using any one of the foregoing; a response unit S1602, configured to, in response to the search request, obtain a sub-network in the target neural network that satisfies the computational force constraint.

In one possible implementation, the apparatus further includes: a configuration acquiring unit, configured to acquire sub-network configuration information corresponding to the sub-network; and the adapting unit is used for adapting the sub-network for different hardware processing platforms according to the sub-network configuration information.

In one possible implementation, the subnet configuration information includes: subnet structures and subnet weights; wherein the sub-network structure comprises: at least one of the number of channels, convolution kernel size, and network depth of each network layer constituting the sub-network.

According to an embodiment of the present application, there is provided a data recognition apparatus, and fig. 17 is a schematic diagram of a composition structure of the data recognition apparatus according to the embodiment of the present application, and as shown in fig. 17, the data recognition apparatus includes: a first input unit 1701 for inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by any one of the above; a first identifying unit 1702, configured to identify the data according to the target neural network to obtain target data; wherein the data includes: at least one of image data, video data, text data, and voice data.

According to an embodiment of the present application, there is provided a data recognition apparatus, fig. 18 is a schematic diagram of a composition structure of the data recognition apparatus according to the embodiment of the present application, and as shown in fig. 18, the data recognition apparatus includes: a second input unit 1801, configured to input data into a target neural network, where the target neural network is a trained super-network obtained by using any one of the foregoing; a determining unit 1802, configured to determine sub-networks in the target neural network that satisfy a computational power constraint; a second identifying unit 1803, configured to identify the data according to the sub-network, so as to obtain target data; wherein the data includes: at least one of image data, video data, text data, and voice data.

FIG. 19 shows a schematic block diagram of an example electronic device 1900 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 19, the electronic apparatus 1900 includes a computing unit 1901, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1902 or a computer program loaded from a storage unit 1908 into a Random Access Memory (RAM) 1903. In the RAM 1903, various programs and data necessary for the operation of the electronic apparatus 1900 can also be stored. The calculation unit 1901, ROM 1902, and RAM 1903 are connected to each other via a bus 1904. An input/output (I/O) interface 1905 is also connected to bus 1904.

A number of components in electronic device 1900 are connected to I/O interface 1905, including: an input unit 1906 such as a keyboard, a mouse, and the like; an output unit 1907 such as various types of displays, speakers, and the like; a storage unit 1908 such as a magnetic disk, optical disk, or the like; and a communication unit 1909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1909 allows the electronic device 1900 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1901 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computation unit 1901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1901 performs the respective methods and processes described above, such as the neural network training method/the network search method/the data recognition method. For example, in some embodiments, the neural network training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1908. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 1900 via the ROM 1902 and/or the communication unit 1909. When the computer program is loaded into RAM 1903 and executed by computing unit 1901, one or more steps of the neural network training method/network searching method/data recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1901 may be configured by any other suitable means (e.g., by means of firmware) to perform a neural network training method/network searching method/data recognition method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solution of the present application can be achieved, and the present invention is not limited thereto.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A neural network training method, the method comprising:

acquiring a first neural network model and a second neural network model;

according to the first neural network model and the second neural network model, carrying out joint training of double-net knowledge distillation on the dimension of the super network in the sub-network of the super network and the dimension of the network layer of the sub-network to obtain a target neural network; wherein the target neural network is a trained hyper-network.

2. The method of claim 1, wherein the first neural network model is a static network and the second neural network model is a dynamic network comprising hybrid computational units.

3. The method of claim 2, wherein the joint training of the super network in the sub-network dimension of the super network for the dual-net knowledge distillation according to the first neural network model and the second neural network model to obtain the target neural network comprises:

obtaining a first loss function in a forward propagation of the first neural network model;

obtaining a second loss function in the forward propagation of the second neural network model;

obtaining a third loss function corresponding to the joint training for the dual-net knowledge distillation according to the first loss function and the second loss function;

and performing joint training of double-net knowledge distillation on the dimension of the super network in the sub-network of the super network and the dimension of the network layer of the sub-network according to the back propagation of the third loss function to obtain the target neural network.

4. The method of claim 3, wherein each sub-network in the target neural network is a sub-network having the same processing performance as the target neural network.

5. The method of claim 2, wherein the hybrid computational unit is a hybrid unit comprising at least two computational layers;

the feature data is input into the hybrid computing unit, fusion processing is carried out in the hybrid computing unit through the at least two operator layers to obtain a fusion processing result, and the fusion processing result is used as the output of the hybrid computing unit.

6. The method of claim 5, wherein the at least two algorithm layers comprise: convolution layers with different convolution kernels or different convolution operation modes are adopted.

7. A method for network searching, comprising:

initiating a search request, wherein the search request is used for representing an operation request for searching a target neural network under a calculation force constraint condition, and the target neural network is a trained super network obtained by adopting any one of claims 1 to 6;

and responding to the search request to obtain the sub-networks meeting the calculation force constraint condition in the target neural network.

8. The method of claim 7, further comprising:

acquiring sub-network configuration information corresponding to the sub-network;

adapting the sub-network for different hardware processing platforms according to the sub-network configuration information.

9. The method of claim 8, wherein the subnet configuration information comprises: subnet structures and subnet weights;

wherein the sub-network structure comprises: at least one of the number of channels, convolution kernel size, and network depth of each network layer constituting the sub-network.

10. A method of data identification, the method comprising:

inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by adopting any one of claims 1 to 6;

wherein the data comprises: at least one of image data, video data, text data, and voice data.

11. A method of data identification, the method comprising:

inputting data into a target neural network, wherein the target neural network is a trained hyper-network obtained by any one of claims 1 to 6;

determining sub-networks in the target neural network which meet computational force constraints;

identifying the data according to the sub-network to obtain target data;

12. An apparatus for neural network training, the apparatus comprising:

the combined training unit is used for carrying out combined training of double-network knowledge distillation on the dimension of the super network in the sub-network of the super network and the dimension of the network layer of the sub-network according to the first neural network model and the second neural network model to obtain a target neural network; wherein the target neural network is a trained hyper-network.

13. A network search apparatus, comprising:

a search unit, configured to initiate a search request, where the search request is used to characterize an operation request for searching a target neural network under a computational constraint, where the target neural network is a trained super network obtained by using any one of claims 1 to 6;

and the response unit is used for responding to the search request to obtain the sub-networks meeting the computational force constraint condition in the target neural network.

14. A data recognition apparatus, the apparatus comprising:

a first input unit, configured to input data into a target neural network, wherein the target neural network is a trained hyper-network obtained by using any one of claims 1 to 6;

15. A data recognition apparatus, the apparatus comprising:

a second input unit, configured to input data into a target neural network, where the target neural network is a trained super-network obtained by using any one of claims 1 to 6;

a determining unit, configured to determine a sub-network in the target neural network that satisfies a computation force constraint condition;

16. A computing device, comprising: a processor for invoking and executing a computer program from a memory, such that the computing device performs the method of any of claims 1-6.

17. A computer-readable storage medium storing a computer program which, when executed by an apparatus, causes the apparatus to perform the method of any one of claims 1 to 6.