CN116468072A

CN116468072A - Deep learning model deployment method, device and equipment

Info

Publication number: CN116468072A
Application number: CN202310449351.XA
Authority: CN
Inventors: 张长浩; 申书恒; 傅欣艺; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-21

Abstract

The embodiment of the specification discloses a deep learning model deployment method, device and equipment, wherein the method can downwards define a search space of a neural network structure search NAS based on an original deep learning model to be compressed; and then searching a target sub-network from the search space as a compressed model of the original deep learning model and deploying the compressed model to target equipment.

Description

Deep learning model deployment method, device and equipment

Technical Field

The present document relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for deploying a deep learning model.

Background

As the size of the deep learning model increases, there is great redundancy in its structure. Under the condition that the query rate per second (QPS) of the model is high, the running cost of the large model is high, and the waste of running resources is serious.

In order to reduce the reasoning overhead of the model as much as possible before the deep learning model is put on line, the model needs to be compressed. The traditional compression mode has quantitative pruning and knowledge distillation, but the two modes depend strongly on expert experience, and the space for reducing the size of the model is limited, so that the compression effect of the model is not ideal, and the situation of seriously wasting operation resources still exists after the model is online.

Disclosure of Invention

The embodiment of the specification provides a deep learning model deployment method, device and equipment, which are used for solving the problem that operation resources are seriously wasted after a model is online due to the fact that a model compression effect is not ideal in the related art.

In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:

in a first aspect, a deep learning model deployment method is provided, including:

acquiring an original deep learning model to be compressed;

defining a search space of a neural network structure search NAS based on the original deep learning model, wherein in the search space, the maximum structure alternative item corresponding to one layer is the structure of the layer in the original deep learning model;

searching a target sub-network from the search space to serve as a compression model of the original deep learning model;

the compression model is deployed to a target device.

In a second aspect, a deep learning model deployment apparatus is provided, including:

acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

In a third aspect, an electronic device is provided, comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

In a fourth aspect, a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:

Acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

According to the at least one technical scheme provided by the embodiment of the specification, because the NAS search space can be searched based on the original deep learning model definition to be compressed, namely, the maximum structure alternative item corresponding to one layer in the search space is the structure of the layer in the original deep learning model, and the rest structure alternative items corresponding to the layer are smaller than the structure of the layer in the original deep learning model, the optimal and most extreme compression of the original deep learning model can be automatically completed by searching one target subnetwork from the search space as a compression model; the compression model is deployed to the target equipment, so that the waste of operation resources (such as memory and time consumption) can be reduced to the greatest extent naturally.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of a deep learning model deployment method according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a search space defined by a deep learning model deployment method according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a bottom-up early-stop training mechanism of structural parameters in a deep learning model deployment method according to an embodiment of the present disclosure.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of a deep learning model deployment device according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

In order to solve the problem that in the related art, the model compression mode is strongly dependent on expert experience, and the space of the model with reduced size is limited, so that the compression effect of the model is not ideal, and the running resources are still seriously wasted after the model is online, the embodiment of the specification provides a deep learning model deployment method and device, which can be executed by electronic equipment or software or hardware equipment installed in the electronic equipment. The electronic device herein may include, but is not limited to, a server or another terminal device, which may include, but is not limited to, any one of a personal computer (personal computer, PC), a notebook computer, etc., and the server includes, but is not limited to: any one of a single server, a plurality of servers, a server cluster and a cloud server.

It should be noted that, the deep learning model in the embodiment of the present disclosure may be any deep learning model for implementing a corresponding function, for example, a deep learning model for implementing any one of functions of risk recognition (such as money laundering, fraud and other risks) of network transaction (including network transfer), two-dimensional code recognition or face recognition and the like. For deep learning models for implementing different functions (deep learning models applied to different scenarios), training data (training set) used may be different, for example, for deep learning models for implementing network transaction risk identification, the training data used may include, but is not limited to, historical transaction data accumulated by a user on a network transaction platform, operation sequences, event sequences, NLP dialogues, basic information of the user (such as user gender, user address, user age, etc.), website status, transaction address information, IP address, transaction device information, attacked status of transaction device, transaction device address, statistical characteristics related to the user, etc.; for a deep learning model for face recognition, the training data adopted can be a face image; for the deep learning model for two-dimensional code recognition, the training data may be the two-dimensional code itself or a picture including the two-dimensional code. That is, the deep learning model deployment method provided by the embodiment of the present disclosure may be suitable for deep learning model deployment in different application scenarios, and solves the problems of non-ideal compression effect and serious resource waste commonly existing in deep learning model deployment in different application scenarios, without being limited to a specific application scenario.

The following describes a deep learning model deployment method provided in the embodiments of the present disclosure.

As shown in fig. 1, a deep learning model deployment method according to an embodiment of the present disclosure may include:

step 102, obtaining an original deep learning model to be compressed.

The original deep learning model may be a trained deep learning model, but the deep learning model is relatively large, and has redundant parts in the structure, which needs to be compressed. In step 102, an original deep learning model may be read from its storage location.

Step 104, defining a SEARCH space of a neural network structure SEARCH (NEURAL ARCHITECTURE SEARCH, NAS) based on the original deep learning model, wherein in the SEARCH space, a maximum structure candidate corresponding to a layer is a structure of the layer in the original deep learning model.

The purpose of NAS is to have an algorithm or framework that automatically finds the best neural architecture on demand (Neural Architecture). In general, NAS can be divided into three major parts: search space (search space), search policy (search strategy), and performance estimation policy (performance estimation strategy). The search space refers to a candidate set of network structures to be searched (hereinafter, indicated by a structure alternative corresponding to a layer), or the search space is all choices that can be adjusted when selecting a neural architecture, for example, a convolution kernel size (kernel size), a convolution channel size (channel size), a convolution type (convolution type), a layer number (layer number), and the like. The search strategy refers to what way to search the best neural architecture in a given search space, such as grid search (grid search) and random search (random search), which are most well known in searching super parameters (hyperparameters), or genetic algorithm (evolution algorithm), etc. The performance estimation strategy refers to how to evaluate a neural architecture when it is selected from the search space, for example, the neural architecture can be actually trained to obtain the actual accuracy (accuracies).

The embodiment of the present disclosure aims to design a search space downward with the structure of each layer (layer) in an original deep learning model as an upper limit, so that a structure candidate (candidate) corresponding to any layer in the search space is a subset of the structure of the layer in the original deep learning model, that is, in the search space, a maximum structure candidate corresponding to one layer is the structure of the layer in the original deep learning model, and the rest of structure candidates corresponding to the layer are smaller than the structure of the layer in the original deep learning model, so that by searching one target subnetwork from the search space as a compression model, optimal and most optimal compression on the original deep learning model can be automatically completed.

Fig. 2 is a schematic diagram of a search space defined by a deep learning model deployment method according to an embodiment of the present disclosure. Assuming that the search space defined based on the original deep learning model includes m layers, the i-1 th layer, the i-th layer, and the i+1 th layer among them are shown in fig. 2, and the i-th layer corresponds to include F as shown in fig. 2 ₁ 、F ₂ 、F ₃ 、…、F _n Options, wherein option F ₁ For 3 x 3 convolution, option F ₂ For a 5 x 5 convolution, option F ₃ For Identity (without any action), option F _n Pooled for 3×3. Note that m and n are integers greater than zero.

Optionally, after defining the structure alternatives in the search space, the embodiment of the present specification further correspondingly defines the structure parameters corresponding to the structure alternatives, where the structure parameters may include, but are not limited to, the structure alternatives and the selection probability of the structure alternatives. For example, as shown in FIG. 2, for structure alternative F corresponding to the ith layer ₁ 、F ₂ 、F ₃ 、…、F _n The structural parameters can be defined as follows: omega ₁ 、ω ₂ 、ω ₃ 、…、ω _n . It will be appreciated that the greater the probability of selection of a fabric alternative, the greater the likelihood that the fabric alternative will be selected when the target subnetwork is later selected. That is, each layer in the search space may be indicated by defining structural parameters (e.g., selection probabilities) of the corresponding structural alternatives for that layerIs selected for the structural alternatives of (a).

And step 106, searching a target sub-network from the search space to serve as a compression model of the original deep learning model.

In step 106, any existing search strategy may be used to search out a model effect equivalent to the original deep learning model, but a target subnetwork smaller in size than the original deep learning model is used as a compressed model of the original deep learning model.

It can be appreciated that the process of searching the sub-network from the search space using the search strategy is actually a network training process, and since the conventional NAS training is a relatively time-consuming and labor-consuming process, one-shot NAS is proposed later. The One-shot NAS can greatly reduce training time by combining all the neural architectures in the search space into One super network (super net). Thus, as an example, the NAS employed in the embodiments of the present description may be One-shot NAS. On this basis, the step 106 may include:

constructing a super network based on the search space;

training the super network based on a training set;

and selecting a target sub-network from the converged super-network as a compression model of the original deep learning model.

Likewise, in the super-network, the maximum structure alternative corresponding to one layer is the structure of the layer in the original deep learning model, and the rest structure alternatives corresponding to the layer are smaller than the structure of the layer in the original deep learning model, that is, the structure alternative corresponding to any layer in the super-network is a subset of the structure of the layer in the original deep learning model.

Optionally, in order to make the compressed model have a model effect equivalent to that of the original deep learning model, in the embodiment of the present specification, the number of layers of the target subnetwork is the same as that of the original deep learning model, but the structure of at least one layer in the target subnetwork is smaller than that of the layer in the deep learning model.

When the super network is trained, the super network is different from the traditional NAS in the aim of searching a better effect, the concept of the NAS is used for the existing effect of the original deep learning model and the original deep learning model, the model reasoning performance is optimized to the maximum extent, moreover, the compressed output is different from the output of the traditional one-shot NAS, the compressed output is one model, and the output of the traditional one-shot NAS is thousands of submodels, which is inconsistent with the compression aim. In order to achieve the compression purpose, in the embodiment of the present disclosure, in addition to training non-structural parameters (also referred to as network parameters), training of structural parameters is further added, so that a model structure is also a parameter that can be converged, different structural parameters correspond to different models, and when the structural parameters are converged, it means that the model structure is determined.

Based on this, training the super network based on the training set may include: and training the structural parameters and the non-structural parameters of the super network alternately based on the training set. Alternatively, the training set for training the super-network and the training set for training the original deep learning model may be the same training set.

It can be understood that the structural parameters and the non-structural parameters of the super network are alternately trained, so that the structural convergence and the network parameter convergence can be separated, and the training difficulty is reduced.

Specifically, the training of the structural parameters and the non-structural parameters of the super network based on the training set may include: the first specifying step and the second specifying step are alternately performed until both the structural parameter and the non-structural parameter reach convergence.

Wherein the first specifying step may include:

sampling at least one sub-network from the super-network according to a first sampling rule;

the structural parameters of the at least one sub-network are trained based on the training set.

The first sampling rule may be a fair sampling rule, i.e. the probability of being sampled is the same for different sub-networks, e.g. a random sampling rule. It can be understood that the fair sampling rule is adopted for different sub-networks, so that the structural parameters of different structural alternatives can be given a fair training opportunity, and the structural parameters obtained by final training have better robustness.

Optionally, the first specifying step may further include:

determining a loss value of the at least one sub-network based on a first loss function, wherein the first loss function comprises expectations of preset operation parameters of the at least one sub-network when the target equipment operates;

Determining a gradient of the at least one subnetwork based on the loss value;

the structural parameters of the at least one sub-network are updated based on the gradient.

Wherein the preset operating parameters may include, but are not limited to, at least one of operating time, memory usage, energy consumption, and computational speed (floating point of per second, flow).

In the embodiment of the present disclosure, the desire to introduce the preset operation parameters of the sub-network when the target device is operated in the loss function (the first loss function) for determining the convergence of the structural parameters is to make the obtained compression model perform better on the hardware performance of the target device when the compression model is operated on the target device, for example, the compression model may have shorter time consumption, smaller occupied memory, faster calculation speed when the compression model is operated on the target device, and so on. Therefore, the compression scheme provided by the embodiment of the specification has hardware perception capability, so that the scheme not only can realize model compression, but also can enable the obtained compression model to have better performance on hardware performance.

As an example, the first loss function may include:

wherein, loss _CE Represents cross entropy loss, II w II represents L2 norm of non-structural parameter, X represents preset operation parameter, E [ X ] ]Representing the preset operating parametersExpected lambda ₁ And lambda (lambda) ₂ Is a harmonic coefficient.

Further, in the case that the structural parameter includes a selection probability of a structural candidate, E [ X ] is calculated as:

E[X]＝ω ₁ ×F ₁ +ω ₂ ×F ₂ +…+ω _i ×F _i +…+ω _n ×F _n

wherein F is _i Representing preset operating parameters, ω, of an ith structural alternative corresponding to a layer _i The selection probability of the i-th structure candidate corresponding to the layer is represented, i=1, 2.

As an example, if the preset operation parameter is time consuming to operate, the first loss function may be specifically expressed as:

where E [ Latency ] represents the desire to run time-consuming, i.e., adding a penalty for hardware time-consuming to the loss function, the calculation formula can be expressed as:

E[Latency]＝ω ₁ ×F ₁ +ω ₂ ×F ₂ +…+ω _i ×F _i +…+ω _n ×F _n

taking the i-th layer in fig. 2 as an example, the calculation formula of E [ Latency ] may specifically be:

E[Latency]＝ω ₁ ×F(CONV_3×3)+

ω ₂ ×F(CONV_5×5)+

ω ₃ ×F(identity)+

...+

ω _n ×F(POOL_3×3)

in specific implementation, the structure alternatives corresponding to each layer in the search space can be modularized, corresponding preset operation parameter predictors are customized for different operators (ops) in the structure alternatives, and the preset operation parameters of the different operators are obtained by testing on different target devices, for example, time consumption predictors are customized for the different operators, and the operation time consumption of the different operators is obtained by testing on different target devices. And adding the preset operation parameters of the operators contained in the different structure alternatives to obtain the preset operation parameters of the structure alternatives.

In addition, in the embodiment of the present disclosure, the gradient is used as an evaluation index to update the structural parameters of the sub-network, so that on one hand, in order to make the structural parameters differentiable, the structural parameters of the sub-network are updated from two aspects of the effect of the sub-network (model effect, which is reflected in the first two summation items in the first loss function) and the regularization information of the hardware (which is reflected in the third summation item in the first loss function), and through a fair sampling test, more excellent and more robust structural parameters can be obtained, and the selection of structural alternatives is completed in the training of the super-network, so that the optimal hardware performance perception model is finally obtained; on the other hand, the convergence rate of the network can be improved because the gradient optimization mode is faster.

Furthermore, the embodiment of the present disclosure uses the gradient of the structural parameter to determine whether the different structural alternatives are good or bad, where a layer generally includes a plurality of structural alternatives, and the larger the gradient corresponding to a structural alternative, the better the structural alternative is, and the larger the probability that the structural alternative is selected.

Of course, in addition to updating the structural parameters of the sub-network by using the gradient as the evaluation index, the structural parameters may be updated by bayesian or reinforcement learning, and the specific form of the evaluation index is not limited in the embodiment of the present disclosure.

Optionally, in order to increase the convergence speed of the structure of the super network, training the structural parameter of the at least one sub-network based on the training set may include:

training the structural parameters of each layer in at least one sub-network according to the sequence from bottom to top, wherein in one sub-network, the layer closer to the output layer is the lower layer, and the layer closer to the input layer is the upper layer;

and under the condition that the structural parameters of the lower layer in the at least one sub-network are converged, fixing the structural parameters of the lower layer, and continuing training the structural parameters of the upper layer of the lower layer.

For example, as shown in fig. 3, for a sub-network, the structural parameters of the sub-network may be trained gradually from bottom to top, in the case that the structural parameters of the i+1 th layer are converged, the structural parameters of the i+1 th layer are fixed, and in turn, the structural parameters of the i+1 th layer, which are the upper layer of the i+1 th layer, are trained and fixed, and accordingly, the convergence state of the structural parameters of the i+1 th layer is "fixed", the convergence state of the structural parameters of the i th layer is "currently to be fixed", the convergence state of the structural parameters of the i-1 th layer is "unfixed", and the other layers are not repeated.

The training mode can be regarded as a greedy early-stopping mechanism, and the training workload can be greatly reduced through the greedy early-stopping mechanism, so that the network convergence speed can be further improved.

It is easy to see that the embodiment of the specification is based on a mode of gradient and greedy, and is matched with the introduction of structural parameters, so that the optimal structural parameters are found in training, and the structural search can be completed quickly.

Wherein the second specifying step may include:

sampling a plurality of sub-networks from the super-network according to a second sampling rule;

the unstructured parameters of the plurality of subnetworks are trained based on a training set.

The second sampling rule may be a fair sampling rule or an unfair sampling rule, where the unfair sampling rule may be designed according to the related needs, for example, a sampling rule may be designed so that a sub-network with a large size gets more training times, and specific reference may be made to the related art, which is not repeated herein.

Optionally, the second specifying step may further include:

determining a loss value for the plurality of subnetworks based on a second loss function;

and updating the unstructured parameters of the plurality of sub-networks based on the loss values.

Wherein the second loss function may be:

wherein, loss _CE Represents cross entropy loss and ii w represents the L2 norm of the unstructured parameter.

It should be noted that, for the specific related process of training and convergence determination of the unstructured parameters, reference may be made to related technology, and detailed description is omitted herein.

Wherein, in the case that the structure parameter includes a selection probability of a structure candidate, selecting a target sub-network from the converged super-networks as a compression model of the original deep learning model may include: respectively selecting a structure alternative item with highest selection probability from all layers of the converged super network to form a target sub-network; and taking the target subnetwork as a compression model of the original deep learning model.

It can be seen that, according to the embodiment of the present disclosure, in the case that the search space is defined downward based on the original deep learning model, the optimal and most extreme compression can be completed by setting the optimization target of the structural parameters (determining the optimization through the loss function in the training process).

Step 108, deploying the compression model to a target device.

The target device may be a physical device (such as a terminal device) or a virtual device (such as a cloud server with limited resources).

According to the deep learning model deployment method provided by the embodiment of the specification, because a downward neural network can be defined to search for an NAS search space based on an original deep learning model to be compressed, namely, the maximum structure alternative item corresponding to one layer in the search space is the structure of the layer in the original deep learning model, and the rest structure alternative items corresponding to the layer are smaller than the structure of the layer in the original deep learning model, the optimal and most extreme compression of the original deep learning model can be automatically completed by searching a target sub-network from the search space as a compression model; the compression model is deployed to the target equipment, so that the waste of operation resources (such as memory and time consumption) can be reduced to the greatest extent naturally.

In summary, the essence of being able to implement compression in the deep learning model deployment method provided in the embodiments of the present disclosure is to define a downward search space.

In addition, compared with the traditional model compression scheme, the deep learning model deployment method provided by the embodiment of the specification indicates the selection of the structure alternatives of each layer in the search space by defining the structure parameters of the corresponding structure alternatives of the layer so as to find the optimal target sub-network; on the other hand, the structural parameters of the sub-network are updated in a gradient-based mode, so that the updated structural parameters are more accurate and objective, the convergence speed of the structural parameters is accelerated, and the convergence speed of the structural parameters can be further improved by matching with a greedy strategy from bottom to top; furthermore, a hardware performance sensing function is added in the compression process, and a regularization item about hardware information can be introduced into a loss function to reflect punishment of the hardware information in structural parameters, so that the searched target subnetwork has better hardware sensing performance, or the searched target subnetwork is more matched with the performance of hardware, and the like.

The method provided by the present specification is described above, and the electronic device provided by the present specification is described below.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring to fig. 4, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 4, but not only one bus or type of bus.

And a memory for storing the program. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the deep learning model deployment device on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

The method disclosed in the embodiment shown in fig. 1 of the present specification can be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in one or more embodiments of the present description may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in a hardware decoding processor or in a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The electronic device may also execute the method provided in the embodiment shown in fig. 1, which is not described in detail in this specification.

Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.

The present description also proposes a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to perform the operations of:

acquiring an original deep learning model to be compressed;

The compression model is deployed to a target device.

The apparatus provided in the embodiments of the present specification will be described below.

As shown in fig. 5, one embodiment of the present description provides a deep learning model deployment apparatus 500, and in a software implementation, the apparatus 500 may include: an original model acquisition module 501, a search space definition module 502, a model compression module 503, and a model deployment module 504.

The original model acquisition module 501 acquires an original deep learning model to be compressed.

The original deep learning model may be a trained deep learning model, but the deep learning model is relatively large, and has redundant parts in the structure, which needs to be compressed. The original model acquisition module 501 may read the original deep learning model from its storage location.

The search space definition module 502 defines a search space of a neural network structure search NAS based on the original deep learning model, wherein in the search space, a maximum structure candidate corresponding to one layer is a structure of the layer in the original deep learning model.

Optionally, after defining the structure alternatives in the search space, the embodiment of the present specification further correspondingly defines the structure parameters corresponding to the structure alternatives, where the structure parameters may include, but are not limited to, a selection probability of the structure alternatives. It will be appreciated that the greater the probability of selection of a fabric alternative, the greater the likelihood that the fabric alternative will be selected when the target subnetwork is later selected. That is, the selection of a structural alternative for each layer in the search space may be indicated by defining structural parameters for that layer's corresponding structural alternative.

Model compression module 503 searches a target sub-network from the search space as a compressed model of the original deep learning model.

In particular, the model compression module 503 may use any existing search strategy to search a model effect equivalent to the original deep learning model, but a target subnetwork smaller than the original deep learning model is used as the compression model of the original deep learning model.

It can be appreciated that the process of searching the sub-network from the search space using the search strategy is actually a network training process, and since the conventional NAS training is a relatively time-consuming and labor-consuming process, one-shot NAS is proposed later. The One-shot NAS can greatly reduce training time by combining all the neural architectures in the search space into One super network (super net). Thus, as one example, model compression module 503 may be used to:

Constructing a super network based on the search space;

training the super network based on a training set;

Training the super network based on a training set may include: and training the structural parameters and the non-structural parameters of the super network alternately based on the training set. Alternatively, the training set for training the super-network and the training set for training the original deep learning model may be the same training set.

Wherein the first specifying step may include:

Optionally, the first specifying step may further include:

Determining a gradient of the at least one subnetwork based on the loss value;

Wherein the preset operating parameters may include, but are not limited to, at least one of operating time, memory usage, energy consumption, and computational speed (floating point ofper second, flow).

As an example, the first loss function may include:

wherein, loss _CE Representing the cross-entropy loss, the L/w represents the L2 norm of the unstructured parameter, X represents a preset operating parameter, EX]Indicating the desire for the preset operating parameter lambda ₁ And lambda (lambda) ₂ Is a harmonic coefficient.

E[X]＝ω ₁ ×F ₁ +ω ₂ ×F ₂ +…+ω _i ×F _i +…+ω _n ×F _n

E[Latency]＝ω ₁ ×F ₁ +ω ₂ ×F ₂ +…+ω _i ×F _i +…+ω _n ×F _n

Wherein the second specifying step may include:

Optionally, the second specifying step may further include:

Wherein the second loss function may be:

The model deployment module 504 deploys the compressed model to the target device.

According to the deep learning model deployment device provided by the embodiment of the specification, because a downward neural network can be defined to search for a NAS search space based on an original deep learning model to be compressed, namely, the maximum structure alternative item corresponding to one layer in the search space is the structure of the layer in the original deep learning model, and the rest structure alternative items corresponding to the layer are smaller than the structure of the layer in the original deep learning model, the optimal and most extreme compression of the original deep learning model can be automatically completed by searching a target sub-network from the search space as a compression model; the compression model is deployed to the target equipment, so that the waste of operation resources (such as memory and time consumption) can be reduced to the greatest extent naturally.

It should be noted that, the deep learning model deployment device 500 can implement a deep learning model deployment method provided in fig. 1, and can achieve the same technical effects, and details may refer to the description of the method embodiment section above, and will not be repeated.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

In summary, the foregoing description is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should be noted that the terms "first," "second," and the like in the description and in the claims are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the terms "first" and "second" are generally intended to be used in a generic sense and not to limit the number of objects, for example, the first character may be one or more.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. A deep learning model deployment method, comprising:

acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

2. The method of claim 1, wherein the searching a target subnetwork from the search space as a compressed model of the original deep learning model comprises:

constructing a super network based on the search space, wherein the maximum structure alternative item corresponding to one layer in the super network is the structure of the layer in the original deep learning model;

Training the super network based on a training set;

3. The method according to claim 2,

the number of layers of the target subnetwork is the same as that of the original deep learning model, and the structure of at least one layer in the target subnetwork is smaller than that of the layer in the deep learning model.

4. The method of claim 2, wherein the training the super network based on the training set comprises:

and training the structural parameters and the non-structural parameters of the super network alternately based on the training set.

5. The method of claim 4, wherein the training the alternating structural and non-structural parameters of the plurality of sub-networks based on a training set comprises:

alternately executing the first designating step and the second designating step;

wherein the first specifying step includes:

training the structural parameters of the at least one sub-network based on a training set;

wherein the second specifying step includes:

6. The method of claim 5, wherein the training the structural parameters of the at least one subnetwork based on the training set comprises:

7. The method of claim 5, the first specifying step further comprising:

determining a gradient of the at least one subnetwork based on the loss value;

8. The method according to claim 7,

the preset operating parameters include at least one of operating time, memory usage, energy consumption, and FLOPS.

9. The method of claim 7, the first loss function being:

wherein, loss _CE Represents cross entropy loss, II w II represents L2 norm of non-structural parameter, X represents preset operation parameter, E [ X ]]Indicating the desire for the preset operating parameter lambda ₁ And lambda (lambda) ₂ Is a harmonic coefficient.

10. The method of claim 9, wherein the structural parameters include a probability of selection of a structural alternative,

E[X]＝ω ₁ ×F ₁ +ω ₂ ×F ₂ +…+ω _i ×F _i +…+ω _n ×F _n

wherein F is _i Representing preset operating parameters, ω, of an ith structural alternative corresponding to a layer _i Representing the probability of selection of the ith structure candidate corresponding to the layer, i=1, 2, …N, n is the number of structure alternatives corresponding to the layer.

11. The method according to any of claims 4-10, the structural parameters comprising a selection probability of structural alternatives, the selecting a target subnetwork from the converged super networks as a compressed model of the original deep learning model comprising:

respectively selecting a structure alternative item with highest selection probability from all layers of the converged super network to form a target sub-network;

And taking the target subnetwork as a compression model of the original deep learning model.

12. The method according to any one of claim 2 to 10,

the training set for training the super network and the training set for training the original deep learning model are the same training set.

13. The method according to any one of claim 1 to 10,

the target device comprises a physical device or a virtual device.

14. A deep learning model deployment apparatus, comprising:

the original model acquisition module acquires an original deep learning model to be compressed;

the search space definition module is used for defining a neural network structure to search a search space of the NAS based on the original deep learning model, wherein in the search space, the maximum structure alternative corresponding to one layer is the structure of the layer in the original deep learning model;

the model compression module searches a target sub-network from the search space to serve as a compression model of the original deep learning model;

and the model deployment module deploys the compressed model to target equipment.

15. An electronic device, comprising:

a processor; and

Acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.

16. A computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:

acquiring an original deep learning model to be compressed;

the compression model is deployed to a target device.