CN118036697A - Model data processing method, apparatus and storage medium - Google Patents
Model data processing method, apparatus and storage medium Download PDFInfo
- Publication number
- CN118036697A CN118036697A CN202410411326.7A CN202410411326A CN118036697A CN 118036697 A CN118036697 A CN 118036697A CN 202410411326 A CN202410411326 A CN 202410411326A CN 118036697 A CN118036697 A CN 118036697A
- Authority
- CN
- China
- Prior art keywords
- model
- clipping
- proportion
- target
- original model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 title claims abstract description 23
- 238000003672 processing method Methods 0.000 title claims abstract description 14
- 230000001133 acceleration Effects 0.000 claims abstract description 205
- 238000005520 cutting process Methods 0.000 claims abstract description 79
- 238000000034 method Methods 0.000 claims abstract description 66
- 238000012545 processing Methods 0.000 claims abstract description 37
- 230000005484 gravity Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 8
- 230000001934 delay Effects 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 3
- 238000013138 pruning Methods 0.000 abstract description 20
- 230000000875 corresponding effect Effects 0.000 description 18
- 230000008569 process Effects 0.000 description 12
- 238000004891 communication Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 9
- 238000013136 deep learning model Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000006835 compression Effects 0.000 description 7
- 238000007906 compression Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a model data processing method, equipment and a storage medium, wherein the method comprises the following steps: acquiring an original model to be processed; carrying out structural cutting processing on the original model according to a first cutting proportion preset by the original model to obtain a cut model; judging whether the current acceleration of the cut model compared with the original model is in a target threshold range or not according to the cut reasoning delay of the cut model on the target hardware; if the current acceleration is not in the target threshold range, adjusting the clipping proportion of the original model according to the current acceleration and the first clipping proportion, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range. The application realizes the adjustment of the model structured pruning proportion by taking the reasoning delay of the target hardware as feedback and combining the acceleration of the target hardware, realizes the quick transplanting of the model to different hardware devices, and saves the cost of model cutting.
Description
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, and a storage medium for processing model data.
Background
The deep learning model compression and acceleration refers to a redundancy simplified model utilizing neural network parameters and structures, and a model with less parameter quantity and more simplified structure is obtained under the condition that the task completion degree is not affected. The compressed model has smaller demands on computing resources and memory, and can meet wider application demands compared with the original model.
The importance of model structured clipping in the current model acceleration field goes without saying. Deep learning models, particularly complex models such as Transformer, CNN, while achieving excellent results in many tasks, their enormous computational and memory footprint limit their deployment and application on resource-constrained devices (e.g., mobile devices, embedded systems). Model clipping is used as an effective model compression technology, and aims to remove redundant parts in a model and reduce the complexity of the model so as to optimize the size and the reasoning speed of the model. However, the traditional model compression mode generally aims at model precision, ignores the running speed of the model after model cutting, has poor generality, and is not suitable for large-scale deployment of small models to mobile terminals and other devices.
Disclosure of Invention
The embodiment of the application mainly aims to provide a model data processing method, device and storage medium, which realize that the model structured pruning proportion is adjusted by taking the reasoning delay of target hardware as feedback and combining the current acceleration of the target hardware, so that the acceleration of the finally cut model meets the target acceleration value of the target hardware, thereby not only realizing rapid model transplanting to different hardware devices and improving the deployment efficiency of large-scale small models, but also greatly saving the model cutting cost.
In a first aspect, an embodiment of the present application provides a method for processing model data, including: acquiring an original model to be processed; carrying out structural cutting processing on the original model according to a first cutting proportion preset by the original model to obtain a cut model; judging whether the current acceleration of the clipped model compared with the original model is in a target threshold range or not according to the clipping reasoning delay of the clipped model on target hardware; and if the current acceleration is not in the target threshold range, adjusting the clipping proportion aiming at the original model according to the current acceleration and the first clipping proportion, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range.
In an embodiment, the performing structural clipping processing on the original model according to a first clipping ratio preset by the original model to be processed to obtain a clipped model includes: acquiring the time delay proportion of different network layers in the original model to be processed, which is inferred on target hardware; and carrying out structural cutting treatment on the original model according to the delay specific gravity and the first cutting proportion to obtain a cut model.
In an embodiment, the obtaining the delay specific gravity of reasoning on the target hardware by different network layers in the original model to be processed includes: obtaining average reasoning delay of different network layers on the target hardware and reasoning sub-delay of different network layers in the original model on the target hardware; and respectively calculating the ratio between the inference sub-delays of the different network layers and the average inference delays to obtain the delay proportion of the inference sub-delays of the different network layers on the target hardware.
In an embodiment, the performing structural clipping processing on the original model according to the delay specific gravity and the first clipping proportion to obtain a clipped model includes: determining sub-clipping ratios corresponding to different network layers according to the delay proportion and the first clipping ratio; and carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of the different network layers to obtain a clipped model.
In an embodiment, the determining whether the current acceleration of the clipped model compared to the original model is within a target threshold according to the post-clipping inference delay of the clipped model on the target hardware includes: acquiring initial reasoning total delay of the original model on the target hardware and the post-clipping reasoning delay of the post-clipping model on the target hardware; calculating to obtain the current acceleration of the model after cutting compared with the original model according to the post-cutting reasoning delay and the initial reasoning total delay; and judging whether the current acceleration amount is within the target threshold range.
In an embodiment, if the current acceleration is not within the target threshold range, adjusting the clipping ratio for the original model according to the current acceleration and the first clipping ratio, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is within the target threshold range, including: and if the current acceleration amount is larger than the maximum value of the target threshold range, reducing the clipping proportion of the original model according to the maximum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the reduced clipping proportion is within the target threshold range.
In an embodiment, if the current acceleration is not within the target threshold range, adjusting the clipping ratio for the original model according to the current acceleration and the first clipping ratio, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is within the target threshold range, including: and if the current acceleration amount is smaller than the minimum value of the target threshold range, increasing the clipping proportion of the original model according to the minimum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the increased clipping proportion is within the target threshold range.
In one embodiment, the target threshold range includes a specified threshold; if the current acceleration is not in the target threshold range, adjusting the clipping ratio of the original model according to the current acceleration and the first clipping ratio, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is in the target threshold range, including: if the current acceleration is not equal to the specified threshold, calculating to obtain a second clipping ratio according to the specified threshold, the current acceleration, the first clipping ratio and a preset control coefficient, wherein the preset control coefficient is used for restricting the change range of the clipping ratio after adjustment; carrying out structural clipping treatment on the original model according to the second clipping proportion to obtain a clipped second model; judging whether a second acceleration amount of the second model compared with the original model is equal to the specified threshold according to a second inference delay of the second model on target hardware; if the second acceleration amount is not equal to the specified threshold, continuing to execute the step of adjusting the clipping proportion until the acceleration amount of the clipped model is equal to the specified threshold; and if the acceleration amount of the cut model is equal to the specified threshold value, determining the cut model as a target model matched with the target hardware.
In a second aspect, an embodiment of the present application provides a model data processing apparatus, including:
the acquisition module is used for acquiring an original model to be processed;
the clipping module is used for carrying out structural clipping treatment on the original model according to a first clipping proportion preset by the original model to obtain a clipped model;
The judging module is used for judging whether the current acceleration of the clipped model compared with the original model is in a target threshold range or not according to the clipping reasoning delay of the clipped model on target hardware;
And the adjusting module is used for adjusting the clipping proportion aiming at the original model according to the current acceleration and the first clipping proportion if the current acceleration is not in the target threshold range, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range.
In an embodiment, the clipping module is configured to obtain delay proportions of reasoning on the target hardware by different network layers in the original model to be processed; and carrying out structural cutting treatment on the original model according to the delay specific gravity and the first cutting proportion to obtain a cut model.
In an embodiment, the clipping module is configured to obtain average inference delays of the different network layers on the target hardware and inference sub-delays of the different network layers in the original model on the target hardware; and respectively calculating the ratio between the inference sub-delays of the different network layers and the average inference delays to obtain the delay proportion of the inference sub-delays of the different network layers on the target hardware.
In an embodiment, the clipping module is configured to determine sub-clipping ratios corresponding to the different network layers according to the delay specific gravity and the first clipping ratio; and carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of the different network layers to obtain a clipped model.
In an embodiment, the judging module is configured to obtain an initial inference total delay of the original model on the target hardware and the post-clipping inference delay of the post-clipping model on the target hardware; calculating to obtain the current acceleration of the model after cutting compared with the original model according to the post-cutting reasoning delay and the initial reasoning total delay; and judging whether the current acceleration amount is within the target threshold range.
In an embodiment, the adjusting module is configured to reduce the clipping ratio for the original model according to the maximum value, the current acceleration amount, and the first clipping ratio if the current acceleration amount is greater than the maximum value of the target threshold range, so that the acceleration amount of the clipped model obtained after clipping the original model according to the reduced clipping ratio is within the target threshold range.
In an embodiment, the adjusting module is configured to increase the clipping ratio for the original model according to the minimum value, the current acceleration amount, and the first clipping ratio if the current acceleration amount is smaller than the minimum value of the target threshold range, so that the acceleration amount of the clipped model obtained after clipping the original model according to the increased clipping ratio is within the target threshold range.
In one embodiment, the target threshold range includes a specified threshold; the adjusting module is used for calculating a second clipping proportion according to the appointed threshold value, the current acceleration quantity, the first clipping proportion and a preset control coefficient if the current acceleration quantity is not equal to the appointed threshold value, and the preset control coefficient is used for restraining the change range of the clipping proportion after adjustment; carrying out structural clipping treatment on the original model according to the second clipping proportion to obtain a clipped second model; judging whether a second acceleration amount of the second model compared with the original model is equal to the specified threshold according to a second inference delay of the second model on target hardware; if the second acceleration amount is not equal to the specified threshold, continuing to execute the step of adjusting the clipping proportion until the acceleration amount of the clipped model is equal to the specified threshold; and if the acceleration amount of the cut model is equal to the specified threshold value, determining the cut model as a target model matched with the target hardware.
In a third aspect, an embodiment of the present application provides an electronic device, including:
At least one processor; and
A memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of the above aspects.
In a fourth aspect, an embodiment of the present application provides a cloud device, including:
At least one processor; and
A memory communicatively coupled to the at least one processor;
Wherein the memory stores instructions executable by the at least one processor to cause the cloud device to perform the method of any of the above aspects.
In a fifth aspect, an embodiment of the present application provides a computer readable storage medium, where computer executable instructions are stored, and when executed by a processor, implement the method according to any one of the above aspects.
In a sixth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above aspects.
According to the model data processing method, device and storage medium provided by the embodiment of the application, the original model is subjected to structural cutting according to the preset cutting proportion, the inference delay of the cut model on the target hardware is used as feedback, whether the cut model acceleration is in the target threshold range is judged according to the inference delay, if not, the fact that the first cutting proportion cannot achieve the acceleration target is indicated, the cutting proportion of the original model can be adjusted according to the current acceleration and the first cutting proportion, so that the model acceleration obtained after the original model is cut according to the final adjusted cutting proportion meets the target threshold range, the purposes of directly taking the inference delay of the target hardware as feedback, combining the current acceleration of the target hardware to adjust the model structured cutting proportion are achieved, the acceleration of the finally cut model meets the target acceleration value of the target hardware, the models can be quickly transplanted to different hardware devices, the deployment efficiency of large-scale and small-scale models is improved, and the cutting cost of the models is greatly saved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the application and that other drawings may be derived from them without inventive faculty.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an application scenario of a model data processing system according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a model data processing method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a model data processing method according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a model data processing method according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a model data processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a cloud device according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application.
The term "and/or" is used herein to describe association of associated objects, and specifically indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
In order to clearly describe the technical solution of the embodiments of the present application, firstly, the terms involved in the present application are explained:
CNN: convolutional Neural Networks, convolutional neural networks.
DNN: deep Neural Networks, deep neural networks.
Transformer: the model is a model which utilizes an attention mechanism to improve the training speed of the model, and is suitable for parallelization calculation and the complexity of the model per se, so that the precision and the performance of the model are higher than those of the circulating neural network which is popular before.
CPU: central Processing Unit, a central processing unit.
GPU: graphics processing unit, a graphics processor.
DSP: DIGITAL SIGNAL Processing, digital signal Processing techniques.
AI: ARTIFICIAL INTELLIGENCE, artificial intelligence.
AutoML: auto MACHINE LEARNING, automatic machine learning.
DDPG: DEEP DETERMINISTIC Policy Gradient algorithm, depth deterministic Policy Gradient algorithm.
The model data processing mode of the embodiment of the application can be applied to any field needing cutting processing of the model.
Taking a deep learning model as an example, the compression and acceleration of the deep learning model refers to a redundancy simplified model utilizing neural network parameters and structures, and a model with less parameter quantity and more simplified structure is obtained under the condition of not affecting the task completion degree. The compressed model has smaller demands on computing resources and memory, and can meet wider application demands compared with the original model.
The importance of model structured clipping in the current model acceleration field goes without saying. Deep learning models, particularly complex models such as Transformer, CNN, while achieving excellent results in many tasks, their enormous computational and memory footprint limit their deployment and application on resource-constrained devices (e.g., mobile devices, embedded systems). Model clipping is used as an effective model compression technology, and aims to remove redundant parts in a model and reduce the complexity of the model so as to optimize the size and the reasoning speed of the model.
Among them, model structured clipping is more particularly advantageous. Compared with unstructured clipping (weight level clipping), structured clipping mainly focuses on simplification of a model structure level, such as removing an entire convolution kernel, a filter or a neuron, and the like, so that the obtained model can directly correspond to reduction of operation units on a hardware execution level, is more beneficial to optimization of a hardware level, is easy to deploy on a specific hardware platform, and does not increase additional software complexity.
The manual setting of the clipping ratio in the model clipping process has certain limitations, especially when facing a large number of models of different scales and structures and different hardware environments, the workload of finding the optimal clipping strategy is huge and time-consuming.
The method for automatically searching the optimal cutting proportion, namely the structural optimization of the automatic model, is one of the hot spots of the current research. The technology generally adopts means such as iteration or reinforcement learning, combines quantized hardware reasoning performance (such as time delay and power consumption) and model precision loss as feedback signals, automatically adjusts a model structure through an algorithm, and explores an optimal tailoring strategy. The method can be more accurately optimized for a specific hardware platform, and subjectivity and blindness which can occur when the cutting proportion is manually set are avoided.
In practical application, the automatic model cutting can effectively improve the running efficiency of the model on specific hardware, reduce reasoning delay, furthest compress the size of the model on the premise of ensuring the model precision, and has important research value and practical significance for promoting the wide application of the deep learning model in resource-limited environments such as embedded systems, mobile equipment and the like. Meanwhile, the method is also beneficial to reducing the labor cost investment in the model development and optimization process, so that the large-scale model acceleration work becomes more efficient and convenient.
In an actual scene, the following mode can be adopted to realize automatic model clipping:
mode one: according to the sparse CNN, a sparse structure optimization and hardware acceleration strategy after model clipping is executed, the scheme has the defects that aiming at an unstructured sparse acceleration strategy on hardware, the universality is poor, the sparse CNN cannot be transplanted to different hardware platforms, moreover, the sparse CNN belongs to an unstructured method, an acceleration effect cannot be directly obtained, a matched acceleration operator is required to be realized on specific hardware, the investment in research and development of mobile equipment is high, and the method is difficult to apply on a large scale.
Mode two: an automated structured pruning framework based on a Deep Neural Network (DNN) employs a bottom-up strategy to build up optimal pruning schemes starting from individual neurons or filters. By analyzing the weight distribution and feature correlation inside the network layer, it is determined which structures (such as filters in convolutional neural networks or neurons in fully connected layers) can be safely deleted while minimizing the impact on the final prediction accuracy. The scheme has the defects that the structured pruning search is carried out by taking the precision of the model as a target, only the reduction of the calculated amount is considered, and the delay acceleration multiple of the model is not actually measured. In addition, the model framework needs to be fully trained every time in the actual use process, the iteration times are too many, and the efficiency is too low.
Mode three: automated machine learning (AutoML) techniques are used to search for efficient model architecture and compression strategies that fit into mobile devices. The scheme has the defects that a complex sampling network and a searching mechanism are required to be constructed, and the precision of the cut model is ensured through continuous model retraining, so that the required time cost is too high, and the scheme is not suitable for large-scale deployment of small models to mobile terminals and other devices.
The model compression mode generally aims at model precision, ignores the running speed of the model after model cutting, has poor generality, and has very low efficiency for deploying small models on a large scale to a scene on equipment such as a mobile terminal.
In an embodiment, the reinforcement learning DDPG method may be used to search, and the inference delay is taken as a reward function, the proportion of the structured pruning is taken as an input sample, the cost model is continuously learned, and after a large number of samples are trained, the cost model can predict the actual model inference delay according to the specified clipping proportion. However, the method relies on a large number of samples (cutting proportion and reasoning delay) to learn in combination, so that errors inevitably exist, and the method is equivalent to retraining an additional network and consumes more time.
In order to solve at least one problem, the embodiment of the application provides a model data processing scheme, by performing structural clipping on an original model according to a preset clipping proportion, taking the inference delay of the clipped model on target hardware as feedback, judging whether the acceleration amount of the clipped model is in a target threshold range according to the inference delay, if not, indicating that the first clipping proportion cannot achieve the acceleration target, adjusting the clipping proportion of the original model according to the current acceleration amount and the first clipping proportion, so that the acceleration amount of the model obtained by clipping the original model according to the final adjusted clipping proportion meets the target threshold range, directly taking the inference delay of the target hardware as feedback, and combining the current acceleration amount of the target hardware to adjust the structural clipping proportion of the model, so that the acceleration amount of the final clipped model meets the target acceleration value of the target hardware, not only can realize quick transplanting of the model to different hardware devices and improve the deployment efficiency of large-scale small model, but also greatly save the clipping cost of the model.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. In the case where there is no conflict between the embodiments, the following embodiments and features in the embodiments may be combined with each other. In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the methods in the following embodiments, so as to realize that the model structured pruning proportion is adjusted by taking the reasoning delay of the target hardware as feedback and combining the current acceleration of the target hardware, so that the acceleration of the finally-cut model meets the target acceleration value of the target hardware, thereby not only realizing fast transplanting of the model to different hardware devices and improving the deployment efficiency of large-scale small models, but also greatly saving the model cutting cost.
In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a large computing system composed of a plurality of computers.
Fig. 2 is a schematic diagram of an application scenario 200 of a model data processing system according to an embodiment of the present application. As shown in fig. 2, the system includes: server 210 and terminal 220, wherein:
server 210 may be a data platform that provides model data processing services, such as a cloud platform. In an actual scenario, there may be multiple servers 210 in a cloud platform, and in fig. 2, 1 server 210 is taken as an example.
The terminal 220 may be a computer, a mobile phone, a tablet, or other devices used when the user logs in to the cloud platform, or there may be a plurality of terminals 220, and in fig. 2,2 terminals 220 are illustrated as an example.
Information transmission between the terminal 220 and the server 210 may be performed through the internet, so that the terminal 220 may access data on the server 210. The terminal 220 and/or the server 210 may be implemented by the electronic device 1.
The model data processing scheme of the embodiment of the application can be deployed on the server 210, the terminal 220 or the server 210 and the terminal 220. The actual scene may be selected based on actual requirements, which is not limited in this embodiment.
When the model data processing scheme is deployed in whole or in part on the server 210, an interface may be invoked open to the terminal 220 to provide algorithmic support to the terminal 220.
The method provided by the embodiment of the application can be realized by the electronic equipment 1 executing corresponding software codes and by carrying out data interaction with a server. The electronic device 1 may be a local terminal device. When the method is run on a server, the method can be implemented and executed based on a cloud interaction system, wherein the cloud interaction system comprises the server and the client device.
In a possible implementation manner, the method provided by the embodiment of the present application provides a graphical user interface through a terminal device, where the terminal device may be the aforementioned local terminal device or the aforementioned client device in the cloud interaction system.
Please refer to fig. 3, which is a model data processing method according to an embodiment of the present application, the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the application scenario shown in fig. 2, so as to implement adjustment of the model structured pruning proportion by taking the inference delay of the target hardware as feedback and combining the current acceleration amount of the target hardware, so that the acceleration amount of the finally pruned model meets the target acceleration value of the target hardware, thereby not only implementing fast migration of the model to different hardware devices, improving the deployment efficiency of large-scale small models, but also greatly saving the cost of model pruning. In this embodiment, taking the terminal 220 as an executing terminal as an example, the method includes the following steps:
Step 301: and obtaining an original model to be processed.
In this step, the original model to be processed may be a deep learning model or an AI model, such as a visual CNN-like model or a language-like transducer model. The original model file to be processed can be obtained from a local storage or a remote storage, or can be input by a user in real time. For example, when an engineering person needs to deploy a certain AI model to a mobile device, the AI model can be designated as an original model to be processed, and after the AI model is automatically cut through the embodiment of the application, a target model matched with the designated mobile device can be obtained, so that quick deployment of the model is realized.
Step 302: and carrying out structural cutting processing on the original model according to a first cutting proportion preset by the original model to obtain a cut model.
In this step, the first clipping ratio is a clipping ratio of the pointer to the whole original model, and may be predetermined according to the historical clipping data of the similar models. The original model is subjected to structural clipping processing based on the first clipping proportion, the structural clipping mainly focuses on simplification of a model structure level, for example, the whole convolution kernel, a filter or a neuron and the like are removed, the obtained clipped model can be directly corresponding to the reduction of operation units on a hardware execution level, the hardware level optimization is facilitated, the clipping model is easy to deploy on a specific hardware platform, and extra software complexity is not increased.
In one embodiment, the step 302 may specifically include: and obtaining the delay proportion of different network layers in the original model to be processed, which is inferred on the target hardware. And carrying out structural cutting processing on the original model according to the delay specific gravity and the first cutting proportion to obtain a cut model.
In this embodiment, the target hardware refers to a hardware device that needs to deploy the original model, and the target hardware may be a mobile device, for example, if the original model a needs to be deployed on a mobile phone, the mobile phone is the target hardware. The deep learning model generally comprises a plurality of neural network layers, the delay proportion refers to the proportion of the inference delay of each network layer on the target hardware to the inference delay of the original model on the target hardware, the inference speed performance of each network layer on the target hardware can be represented, when the original model is subjected to structural cutting, the inference performance of each network layer on the target hardware in the original model is fully utilized, the inference delay proportion of different network layers on the target hardware is obtained, the original model is subjected to structural cutting by combining the delay proportion of each network layer and the integral first cutting proportion, and the integral cutting proportion is flexibly distributed to each module, so that the model cutting can exert more effective acceleration effect.
In one embodiment, obtaining the delay specific gravity of different network layers in the original model to be processed, which is inferred on the target hardware, includes: and obtaining average inference delays of different network layers on the target hardware and inference sub-delays of different network layers on the target hardware in the original model. And respectively calculating the ratio between the inference sub-delays and the average inference delays of different network layers to obtain the delay proportion of the inference of the different network layers on the target hardware.
In this embodiment, the original model may be divided into layer-by-layer modules according to the network layer, and each module is respectively inferred on the target hardware, and the inference delay of each module is recorded as follows:
Where N represents the total number of network layers contained in the original model, L N represents the nth network layer, And (3) representing the inference sub-delay of the N-th network layer on the target hardware in the original model, wherein N is a positive integer. T 0 represents the total time delay of the reasoning of the original model on the target hardware, and the average reasoning time delay corresponding to each network layer is。
The average inference delay of the inference sub-delays of each network layer can be calculated by adopting the following formulaThe specific gravity of (2) is as follows:
Where x N represents the ratio between the inference sub-delay and the average inference delay of the layer L network layer in the original model, i.e. the delay specific gravity. The proportion between the inference sub-delay and the average inference delay is used as the delay proportion, so that data calculation can be simplified, the delay proportion of each network layer on the target hardware is accurately represented, and further accurate reference is provided for structural cutting.
In an embodiment, the method for performing structural clipping processing on the original model according to the delay specific gravity and the first clipping proportion to obtain a clipped model includes: and determining sub-clipping ratios corresponding to different network layers according to the delay specific gravity and the first clipping ratio. And carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of different network layers to obtain a clipped model.
In this embodiment, the first clipping proportion may be allocated to the corresponding network layer according to the delay specific gravity corresponding to each network layer in the original model, so as to determine the sub-clipping proportion corresponding to each network layer, where the sub-clipping proportion corresponding to each network layer may be positively correlated with the corresponding delay specific gravity.
For example, the actual sub-clipping ratio allocated to each network layer may be adjusted according to the following formula:
wherein r i represents a cutting proportion preset in the ith iteration of the original model, then Representing a first clipping ratio,Representing the sub-clipping ratio of the N-th network layer in the original model.
And then carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of different network layers to obtain a clipped model. Specifically, according to the clipping proportion of each layer, structural clipping can be performed layer by layer, for example, the convolution layer clipping output channel dimension, the full-connection layer clipping column dimension, the corresponding network layer module parameters are reduced, and the calculated amount and the parameter amount are reduced proportionally. The actual performance of each network layer on the target hardware is fully utilized, and the adaptation degree of the model after cutting and the target hardware is improved.
Step 303: and judging whether the current acceleration of the cut model compared with the original model is in a target threshold range or not according to the cut reasoning delay of the cut model on the target hardware. If yes, go to step 305, otherwise go to step 304.
In this step, the post-clipping inference delay refers to a delay of deploying the post-clipping model in the step 302 on the target hardware to perform inference, and characterizes an inference speed of the post-clipping model on the target hardware, so that it can be determined whether the current acceleration amount of the post-clipping model obtained in the step 302 compared with the original model meets the target threshold requirement based on the post-clipping inference delay. The target threshold range may be set according to actual requirements, and the target threshold range may be a range including a plurality of target thresholds, or may be a single fixed specified threshold, for example, the target threshold may be a value representing an acceleration multiple, and if the target threshold s=2, it is assumed that the acceleration multiple of the model after clipping is 2, then the model clipping requirement can be met.
The time delay feedback obtained by combining specific hardware reasoning is taken as a reference and is a critical ring in model structural clipping. This is because the computational characteristics, memory access patterns, and parallel processing capabilities of different hardware platforms, such as CPU, GPU, DSP or dedicated AI chips, vary and thus the degree of sensitivity to model architecture varies. The model is actually operated on the target hardware, and performance indexes such as inference delay and the like are obtained, so that a direct guiding basis can be provided for a cutting strategy, and the method helps to design a high-efficiency model structure which can meet the precision requirement and can utilize the hardware characteristics to the greatest extent. The method for optimizing the software and hardware in a collaborative manner has great significance in the aspects of improving the model reasoning speed, reducing the energy consumption, improving the user experience and the like in practical application.
In an embodiment, the first clipping ratio may be determined according to a target acceleration threshold, for example, the target acceleration multiple is set to s=2, and the overall first clipping ratio of the original model may be r 1 =1/s=0.5.
In one embodiment, the step 302 may specifically include: and acquiring initial reasoning total delay of the original model on the target hardware and post-clipping reasoning delay of the post-clipping model on the target hardware. And calculating to obtain the current acceleration of the model after cutting compared with the original model according to the reasoning delay after cutting and the initial reasoning total delay. And judging whether the current acceleration amount is within a target threshold range.
In this implementation, the original model may be deployed on the target hardware to perform reasoning, so as to obtain an initial total reasoning delay T 0, and the model after clipping in step 302 is deployed on the target hardware, so as to obtain a new post-clipping reasoning delay T i after reasoning, and then the acceleration multiple S i of the post-clipping model compared with the original model may be calculated by adopting the following formula:
Si=T0/Ti
Wherein S i represents the acceleration multiple of the i-th cut model compared to the original model. The acceleration multiple S i is adopted to represent the current acceleration quantity of the model after cutting compared with the original model, and then whether the acceleration multiple S i is in the target threshold range is directly judged, so that the model is more visual.
Step 304: if the current acceleration is not in the target threshold range, adjusting the clipping proportion of the original model according to the current acceleration and the first clipping proportion, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range.
In the step, if the current acceleration amount is not in the target threshold range, it is indicated that the first clipping ratio cannot achieve the acceleration target, and the clipping ratio of the original model can be adjusted according to the current acceleration amount and the first clipping ratio, so that the model acceleration amount obtained after clipping the original model according to the finally adjusted clipping ratio meets the target threshold range, the adjustment of the model structured clipping ratio is directly performed by taking the reasoning delay of the target hardware as feedback and combining the current acceleration amount of the target hardware, so that the acceleration amount of the finally clipped model meets the target acceleration value of the target hardware, the quick transfer of the model to different hardware devices can be achieved, the deployment efficiency of large-scale small models is improved, and the cost of model clipping is greatly saved.
In one embodiment, step 304 may specifically include: if the current acceleration amount is larger than the maximum value of the target threshold range, reducing the clipping proportion for the original model according to the maximum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the reduced clipping proportion is within the target threshold range.
In this embodiment, if the current acceleration is greater than the maximum value of the target threshold range, it indicates that the current clipping ratio is too large, and the clipping ratio may be reduced appropriately, so as to enter the next clipping process, so that after the original model is clipped according to the reduced clipping ratio, the acceleration of the obtained clipped model is within the target threshold range, and the clipping model meeting the acceleration requirement may be finally obtained through multiple clipping ratio adjustment.
In one embodiment, step 304 may specifically include: if the current acceleration amount is smaller than the minimum value of the target threshold range, the clipping proportion of the original model is increased according to the minimum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the increased clipping proportion is within the target threshold range.
In this embodiment, if the current acceleration is smaller than the minimum value of the target threshold range, it is indicated that the current clipping ratio is too small, the clipping ratio may be appropriately increased, and the next clipping process is performed, so that the acceleration of the obtained clipped model is within the target threshold range after clipping the original model according to the increased clipping ratio, and the clipped model meeting the acceleration requirement may be finally obtained after clipping ratio adjustment for multiple times.
In one embodiment, the target threshold range includes a specified threshold. Step 304 may specifically include: and if the current acceleration amount is not equal to the specified threshold, calculating to obtain a second clipping ratio according to the specified threshold, the current acceleration amount, the first clipping ratio and a preset control coefficient. And carrying out structural clipping treatment on the original model according to the second clipping proportion to obtain a clipped second model. And judging whether a second acceleration amount of the second model compared with the original model is equal to a specified threshold according to a second inference delay of the second model on the target hardware. And if the second acceleration amount is not equal to the specified threshold value, continuing to execute the step of adjusting the clipping proportion until the acceleration amount of the clipped model is equal to the specified threshold value.
In this embodiment, the preset control coefficient is used to restrict the range of variation of the clipping ratio after adjustment, and can be set according to the actual requirement, so that the coefficient can avoid the phenomenon of excessive adjustment times caused by excessive numerical variation of the clipping ratio in the process of adjusting the clipping ratio, and improve the adjustment efficiency.
Taking the case that the target threshold range is a fixed threshold value S as an example, if the current acceleration multiple S i > S, it is indicated that the current clipping ratio r i is too large, so that the clipping ratio of the next time can be properly reduced, and if the current acceleration multiple S i < S, it is indicated that the current clipping ratio r i is too small, so that the clipping ratio of the next time can be properly increased. Specifically, the following formula may be used to calculate the next clipping ratio:
Then r i+1 represents the clipping ratio for the i+1st iteration, i.e. the adjusted clipping ratio. P represents a preset control coefficient, when S i is greater than S, r i+1<ri can be ensured through the above formula, so that the next reasoning delay T i+1 obtained after the original model is cut according to the adjusted cutting proportion r i+1 is adjusted accordingly, and the acceleration multiple S i+1 of the model after the next cutting is converged to the target acceleration multiple S.
Step 305: and if the acceleration amount of the model after clipping is equal to a specified threshold value, determining the model after clipping as a target model matched with target hardware.
In this step, if S i =s, for the target hardware, for the acceleration target achievement of the original model, the i-th clipped model may be determined as the target model matching the target hardware, and further the target model may be deployed on the target hardware.
As shown in fig. 4, a flow chart of a model data processing method provided by an embodiment of the present application is shown, and the embodiment automatically determines a model clipping ratio meeting an acceleration requirement in a loop iteration manner, and mainly includes: obtaining the layer-by-layer delay proportion, carrying out structured pruning on an original model, distributing the cutting proportion layer by layer, carrying out layer-by-layer pruning, measuring the model speed, judging whether the model speed meets the standard, ending if the model speed meets the standard, carrying out reasoning delay feedback, judging whether the acceleration multiple of the model after cutting is equal to a specified threshold value, if the acceleration multiple of the model after cutting is smaller than the specified threshold value, returning to the original model to continue the cycle of the next structured pruning after increasing the cutting proportion, and returning to the original model to continue the cycle of the next structured pruning after decreasing the cutting proportion if the acceleration multiple of the model after cutting is larger than the specified threshold value until the acceleration multiple of the model after cutting is equal to the specified threshold value, and ending the iteration.
Alternatively, specific details may be as follows:
1. An AI model to be cut, such as a visual CNN-like model or a language-like transducer model, is first obtained.
2. A target acceleration multiple S is set, for example, s=2, the initial overall clipping ratio is r 1 =1/s=0.5, the maximum iteration number is I max =100, and the preset control coefficient p=0.5.
3. Dividing an original model into layer-by-layer modules according to a network layer, respectively reasoning each module on target hardware, and recording the reasoning delay of each module as follows:
Where N represents the total number of network layers contained in the original model, L N represents the nth network layer, And (3) representing the inference sub-delay of the N-th network layer on the target hardware in the original model, wherein N is a positive integer. T 0 represents the total time delay of the reasoning of the original model on the target hardware, and the average reasoning time delay corresponding to each network layer is。
4. Calculating to obtain the average reasoning delay of the reasoning sub delay of each network layer in the original modelThe specific gravity of (2) is as follows: /(I)
For i=1, 2, … I max do (start cycle)
6. The whole cutting proportion of the current model is r i;
7. According to the delay proportion obtained in the step 5, the actual cutting proportion allocated to each network module is adjusted as follows:
Wherein r i represents a clipping proportion preset by the ith iteration of the original model, Representing the sub-clipping ratio of the N-th network layer in the original model.
8. And according to the clipping proportion of each network layer, carrying out structural clipping layer by layer (such as convolutional layer clipping output channel dimension and full connection layer clipping column dimension, corresponding module parameters are reduced, and the calculated amount and parameter amount are proportionally reduced).
9. Deploying the cut model on target hardware, obtaining new delay T i after reasoning, and calculating the acceleration multiple S i=T0/Ti of the cut model compared with the original model.
10. If S i =s, the acceleration target is reached, ending the loop.
11. If S i is not equal to S, calculating to obtain the integral model clipping ratio r i+1 of the next iteration as follows:
In the above formula, if S i > S, r i+1<ri may be set, or S i < S may be set, where r i+1>ri may be set, in both cases, by adjusting the clipping model T i+1 for the next iteration, so that S i+1 converges toward the target acceleration multiple S. And then returns to step 5 to start the next iteration loop.
End for (after the number of iterations reaches a preset 100, end the loop)
14. Assuming that the overall cutting proportion of the finally obtained model is r i+1, and further determining that the cutting proportion of each module is:
15. And carrying out structural cutting layer by layer according to the cutting proportion of each module to obtain a target model and an inference delay T i+1 in target hardware. And (5) ending.
According to the model data processing method, the model structured pruning strategy taking the target hardware delay time as the optimization guide is used, the calculation amount or the parameter amount is not referred, the target hardware delay time is directly taken as the target, the model structured pruning proportion is searched as feedback, the pruning proportion can be adjusted in real time according to the acceleration requirement of the target hardware in the iteration process through an automatic searching mechanism, the rapid convergence is realized, the cost of manual intervention and tuning is effectively saved, the calculation amount or the parameter amount is not relied on, and the fact that the actual running delay of the model after cutting is consistent with the preset target can be ensured.
According to the scheme provided by the embodiment of the application, not only is the overall cutting proportion of the model referenced, but also the cutting proportion of the specific modules of each layer is fully utilized, and the overall cutting proportion is flexibly distributed to each module, so that the model cutting can exert the most effective acceleration effect. The method does not need the process of fine tuning or retraining the pruned model, thereby remarkably reducing the time and space cost, improving the efficiency, having good universality and being conveniently applied to different types of model structures. The scheme does not need to be bound to specific hardware, can be quickly transplanted to different platforms, and is easy to deploy on a large scale.
Please refer to fig. 5, which is a model data processing method according to an embodiment of the present application, the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the application scenario shown in fig. 2, so as to implement adjustment of the model structured pruning proportion by taking the inference delay of the target hardware as feedback and combining the current acceleration amount of the target hardware, so that the acceleration amount of the finally pruned model meets the target acceleration value of the target hardware, thereby not only implementing fast migration of the model to different hardware devices, improving the deployment efficiency of large-scale small models, but also greatly saving the cost of model pruning. In this embodiment, taking the terminal 220 as an executing terminal as an example, the method includes the following steps:
Step 501: and acquiring an original model to be processed, a first cutting proportion preset by the original model and initial reasoning total delay of the original model on target hardware.
Step 502: and obtaining average reasoning delay of different network layers in the original model on the target hardware and reasoning sub-delay of different network layers in the original model on the target hardware.
Step 503: and respectively calculating the ratio between the inference sub-delays and the average inference delays of different network layers to obtain the delay proportion of the inference of the different network layers on the target hardware.
Step 504: and determining sub-clipping ratios corresponding to different network layers according to the delay specific gravity and the first clipping ratio.
Step 505: and carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of different network layers to obtain a clipped model.
Step 506: and acquiring the post-clipping reasoning delay of the post-clipping model on the target hardware.
Step 507: and calculating to obtain the current acceleration of the model after cutting compared with the original model according to the reasoning delay after cutting and the initial reasoning total delay.
Step 508: it is determined whether the current acceleration amount is equal to a specified threshold.
Step 509: if the current acceleration is greater than the specified threshold, reducing the clipping proportion of the original model according to the maximum value, the current acceleration and the current clipping proportion to obtain the adjusted clipping proportion. Step 511 is then entered.
Step 510: if the current acceleration is smaller than the specified threshold, increasing the clipping proportion of the original model according to the minimum value, the current acceleration and the current clipping proportion to obtain the adjusted clipping proportion. Step 511 is then entered.
Step 511: and respectively determining the sub-clipping ratios corresponding to different network layers according to the delay proportion and the clipping ratio after adjustment. And then returns to step 505.
Step 512: if the current acceleration is equal to the specified threshold, determining the current clipping proportion as a final clipping proportion matched with the target hardware, performing structural clipping on the original model according to the final clipping proportion to obtain a target model, and deploying the target model on the target hardware.
The details of each step of the model data processing method can be referred to the related description of the above embodiment, which is not repeated here.
Please refer to fig. 6, which is a model data processing apparatus 600 according to an embodiment of the present application, which is applicable to the electronic device 1 shown in fig. 1 and is applicable to the application scenario shown in fig. 2, so as to implement adjustment of the model structured pruning proportion by taking the inference delay of the target hardware as feedback and combining the current acceleration of the target hardware, so that the acceleration of the finally pruned model meets the target acceleration value of the target hardware, thereby not only implementing fast migration of the model to different hardware devices and improving the deployment efficiency of large-scale small models, but also greatly saving the cost of model pruning. The device comprises: the functional principles of the acquisition module 601, the clipping module 602, the judging module 603 and the adjusting module 604 are as follows:
the obtaining module 601 is configured to obtain an original model to be processed.
The clipping module 602 is configured to perform structural clipping processing on the original model according to a first clipping proportion preset by the original model, so as to obtain a clipped model.
The judging module 603 is configured to judge whether a current acceleration of the clipped model compared with the original model is within a target threshold range according to a post-clipping inference delay of the clipped model on the target hardware.
And the adjusting module 604 is configured to adjust the clipping ratio for the original model according to the current acceleration amount and the first clipping ratio if the current acceleration amount is not within the target threshold range, so that the acceleration amount of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is within the target threshold range.
In one embodiment, the clipping module 602 is configured to obtain the delay specific gravity of the reasoning on the target hardware of different network layers in the raw model to be processed. And carrying out structural cutting processing on the original model according to the delay specific gravity and the first cutting proportion to obtain a cut model.
In one embodiment, the clipping module 602 is configured to obtain the average inference delay of different network layers on the target hardware and the inference sub-delay of different network layers in the original model on the target hardware. And respectively calculating the ratio between the inference sub-delays and the average inference delays of different network layers to obtain the delay proportion of the inference of the different network layers on the target hardware.
In one embodiment, the clipping module 602 is configured to determine sub-clipping ratios corresponding to different network layers according to the delay specific gravity and the first clipping ratio. And carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of different network layers to obtain a clipped model.
In one embodiment, the determining module 603 is configured to obtain an initial inference total delay of the original model on the target hardware and a post-clipping inference delay of the post-clipping model on the target hardware. And calculating to obtain the current acceleration of the model after cutting compared with the original model according to the reasoning delay after cutting and the initial reasoning total delay. And judging whether the current acceleration amount is within a target threshold range.
In an embodiment, the adjusting module 604 is configured to reduce the clipping ratio to the original model according to the maximum value, the current acceleration amount and the first clipping ratio if the current acceleration amount is greater than the maximum value of the target threshold range, so that the acceleration amount of the clipped model obtained after clipping the original model according to the reduced clipping ratio is within the target threshold range.
In an embodiment, the adjusting module 604 is configured to increase the clipping ratio for the original model according to the minimum value, the current acceleration amount and the first clipping ratio if the current acceleration amount is smaller than the minimum value of the target threshold range, so that the acceleration amount of the clipped model obtained after clipping the original model according to the increased clipping ratio is within the target threshold range.
In one embodiment, the target threshold range includes a specified threshold. The adjusting module 604 is configured to calculate, if the current acceleration is not equal to the specified threshold, a second clipping ratio according to the specified threshold, the current acceleration, the first clipping ratio, and a preset control coefficient, where the preset control coefficient is used to constrain a range of variation of the clipping ratio after adjustment. And carrying out structural clipping treatment on the original model according to the second clipping proportion to obtain a clipped second model. And judging whether a second acceleration amount of the second model compared with the original model is equal to a specified threshold according to a second inference delay of the second model on the target hardware. And if the second acceleration amount is not equal to the specified threshold value, continuing to execute the step of adjusting the clipping proportion until the acceleration amount of the clipped model is equal to the specified threshold value. And if the acceleration amount of the model after clipping is equal to a specified threshold value, determining the model after clipping as a target model matched with target hardware.
For a detailed description of the above model data processing apparatus 600, please refer to the description of the related method steps in the above embodiment, the implementation principle and technical effects are similar, and the detailed description of this embodiment is omitted here.
Fig. 7 is a schematic structural diagram of a cloud device 70 according to an exemplary embodiment of the present application. The cloud device 70 may be used to run the methods provided in any of the embodiments described above. As shown in fig. 7, the cloud device 70 may include: memory 704 and at least one processor 705, one for example in fig. 7.
Memory 704 for storing computer programs and may be configured to store various other data to support operations on cloud device 70. The memory 704 may be an object store (Object Storage Service, OSS).
The memory 704 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The processor 705 is coupled to the memory 704, and is configured to execute a computer program in the memory 704, so as to implement the solutions provided by any of the method embodiments described above, and specific functions and technical effects that can be implemented are not described herein.
Further, as shown in fig. 7, the cloud device further includes: firewall 701, load balancer 702, communication component 706, power component 703, and other components. Only some components are schematically shown in fig. 7, which does not mean that the cloud device only includes the components shown in fig. 7.
In one embodiment, the communication component 706 of fig. 7 is configured to facilitate wired or wireless communication between the device in which the communication component 706 is located and other devices. The device in which the communication component 706 is located may access a wireless network based on a communication standard, such as a WiFi,2G, 3G, 4G, LTE (Long Term Evolution, long term evolution, LTE for short), 5G, or a combination thereof. In one exemplary embodiment, the communication component 706 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the Communication component 706 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency identification (Radio Frequency Identification, RFID) technology, infrared data Association (IrDA) technology, ultra Wide Band (UWB) technology, bluetooth (BT) technology, and other technologies.
In one embodiment, the power supply 703 of fig. 7 provides power to the various components of the device in which the power supply 703 is located. The power components 703 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the devices in which the power components reside.
The embodiment of the application also provides a computer readable storage medium, wherein computer executable instructions are stored in the computer readable storage medium, and when the processor executes the computer executable instructions, the method of any of the previous embodiments is realized.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding embodiments.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules may be combined or integrated into another system, or some features may be omitted or not performed.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform some of the steps of the methods of the various embodiments of the application.
It should be appreciated that the Processor may be a central processing unit (Central Processing Unit, abbreviated as CPU), or may be other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, abbreviated as DSP), application SPECIFIC INTEGRATED Circuit (ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution. The memory may include a high-speed RAM (Random Access Memory ) memory, and may further include a nonvolatile memory NVM (Nonvolatile memory, abbreviated as NVM), such as at least one magnetic disk memory, and may further be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.
The storage medium may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random-Access Memory (SRAM), electrically erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). It is also possible that the processor and the storage medium reside as discrete components in an electronic device or a master device.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article of apparel, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article of apparel, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article of apparel, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.
In the technical scheme of the application, the related information such as user data and the like is collected, stored, used, processed, transmitted, provided, disclosed and the like, which are all in accordance with the regulations of related laws and regulations and do not violate the popular public order.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
Claims (12)
1. A model data processing method, characterized by comprising:
acquiring an original model to be processed;
Carrying out structural cutting processing on the original model according to a first cutting proportion preset by the original model to obtain a cut model;
judging whether the current acceleration of the clipped model compared with the original model is in a target threshold range or not according to the clipping reasoning delay of the clipped model on target hardware;
And if the current acceleration is not in the target threshold range, adjusting the clipping proportion aiming at the original model according to the current acceleration and the first clipping proportion, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range.
2. The method according to claim 1, wherein the performing structural clipping processing on the original model according to a first clipping ratio preset by the original model to be processed to obtain a clipped model includes:
Acquiring the time delay proportion of different network layers in the original model to be processed, which is inferred on target hardware;
and carrying out structural cutting treatment on the original model according to the delay specific gravity and the first cutting proportion to obtain a cut model.
3. The method according to claim 2, wherein the obtaining the delay specific gravity of reasoning on the target hardware by different network layers in the original model to be processed includes:
Obtaining average reasoning delay of different network layers on the target hardware and reasoning sub-delay of different network layers in the original model on the target hardware;
And respectively calculating the ratio between the inference sub-delays of the different network layers and the average inference delays to obtain the delay proportion of the inference sub-delays of the different network layers on the target hardware.
4. The method according to claim 2, wherein the performing structural clipping processing on the original model according to the delay specific gravity and the first clipping ratio to obtain a clipped model includes:
Determining sub-clipping ratios corresponding to different network layers according to the delay proportion and the first clipping ratio;
And carrying out structural clipping treatment on the original model layer by layer according to the sub clipping proportion of the different network layers to obtain a clipped model.
5. The method of claim 1, wherein said determining whether the current acceleration of the cropped model compared to the original model is within a target threshold based on a post-cropping inference delay of the cropped model on target hardware comprises:
Acquiring initial reasoning total delay of the original model on the target hardware and the post-clipping reasoning delay of the post-clipping model on the target hardware;
Calculating to obtain the current acceleration of the model after cutting compared with the original model according to the post-cutting reasoning delay and the initial reasoning total delay;
And judging whether the current acceleration amount is within the target threshold range.
6. The method according to claim 1 or 5, wherein if the current acceleration amount is not within the target threshold range, adjusting the clipping ratio for the original model according to the current acceleration amount and the first clipping ratio so that the acceleration amount of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is within the target threshold range, comprises:
and if the current acceleration amount is larger than the maximum value of the target threshold range, reducing the clipping proportion of the original model according to the maximum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the reduced clipping proportion is within the target threshold range.
7. The method according to claim 1 or 5, wherein if the current acceleration amount is not within the target threshold range, adjusting the clipping ratio for the original model according to the current acceleration amount and the first clipping ratio so that the acceleration amount of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is within the target threshold range, comprises:
And if the current acceleration amount is smaller than the minimum value of the target threshold range, increasing the clipping proportion of the original model according to the minimum value, the current acceleration amount and the first clipping proportion, so that the acceleration amount of the clipped model obtained after clipping the original model according to the increased clipping proportion is within the target threshold range.
8. The method of claim 1 or 5, wherein the target threshold range comprises a specified threshold; if the current acceleration is not in the target threshold range, adjusting the clipping ratio of the original model according to the current acceleration and the first clipping ratio, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping ratio is in the target threshold range, including:
If the current acceleration is not equal to the specified threshold, calculating to obtain a second clipping ratio according to the specified threshold, the current acceleration, the first clipping ratio and a preset control coefficient, wherein the preset control coefficient is used for restricting the change range of the clipping ratio after adjustment;
carrying out structural clipping treatment on the original model according to the second clipping proportion to obtain a clipped second model;
Judging whether a second acceleration amount of the second model compared with the original model is equal to the specified threshold according to a second inference delay of the second model on target hardware;
If the second acceleration amount is not equal to the specified threshold, continuing to execute the step of adjusting the clipping proportion until the acceleration amount of the clipped model is equal to the specified threshold;
and if the acceleration amount of the cut model is equal to the specified threshold value, determining the cut model as a target model matched with the target hardware.
9. A model data processing apparatus, characterized by comprising:
the acquisition module is used for acquiring an original model to be processed;
the clipping module is used for carrying out structural clipping treatment on the original model according to a first clipping proportion preset by the original model to obtain a clipped model;
The judging module is used for judging whether the current acceleration of the clipped model compared with the original model is in a target threshold range or not according to the clipping reasoning delay of the clipped model on target hardware;
And the adjusting module is used for adjusting the clipping proportion aiming at the original model according to the current acceleration and the first clipping proportion if the current acceleration is not in the target threshold range, so that the acceleration of the clipped model obtained after clipping the original model according to the adjusted clipping proportion is in the target threshold range.
10. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any one of claims 1-8.
11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of claims 1-8.
12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410411326.7A CN118036697B (en) | 2024-04-08 | 2024-04-08 | Model data processing method, apparatus and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410411326.7A CN118036697B (en) | 2024-04-08 | 2024-04-08 | Model data processing method, apparatus and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118036697A true CN118036697A (en) | 2024-05-14 |
CN118036697B CN118036697B (en) | 2024-06-28 |
Family
ID=90989384
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410411326.7A Active CN118036697B (en) | 2024-04-08 | 2024-04-08 | Model data processing method, apparatus and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118036697B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116629341A (en) * | 2023-05-09 | 2023-08-22 | 阿里巴巴(中国)有限公司 | Structure adjustment method and device for model |
CN116822593A (en) * | 2023-06-01 | 2023-09-29 | 西安电子科技大学 | Large-scale pre-training language model compression method based on hardware perception |
CN117114074A (en) * | 2023-05-17 | 2023-11-24 | 上海任意门科技有限公司 | Training and data processing method and device for neural network model and medium |
US20240046099A1 (en) * | 2022-07-29 | 2024-02-08 | Tata Consultancy Services Limited | Method and system for jointly pruning and hardware acceleration of pre-trained deep learning models |
-
2024
- 2024-04-08 CN CN202410411326.7A patent/CN118036697B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240046099A1 (en) * | 2022-07-29 | 2024-02-08 | Tata Consultancy Services Limited | Method and system for jointly pruning and hardware acceleration of pre-trained deep learning models |
CN116629341A (en) * | 2023-05-09 | 2023-08-22 | 阿里巴巴(中国)有限公司 | Structure adjustment method and device for model |
CN117114074A (en) * | 2023-05-17 | 2023-11-24 | 上海任意门科技有限公司 | Training and data processing method and device for neural network model and medium |
CN116822593A (en) * | 2023-06-01 | 2023-09-29 | 西安电子科技大学 | Large-scale pre-training language model compression method based on hardware perception |
Non-Patent Citations (2)
Title |
---|
姚巍巍等: "基于模型剪枝和半精度加速改进YOLOv3-tiny算法的实时司机违章行为检测", 计算机系统应用, no. 04, 15 April 2020 (2020-04-15), pages 41 - 47 * |
李运: "深度神经网络压缩与加速方法研究", 中国博士学位论文全文数据库 (信息科技辑), no. 03, 15 March 2023 (2023-03-15), pages 140 - 23 * |
Also Published As
Publication number | Publication date |
---|---|
CN118036697B (en) | 2024-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111026548B (en) | Power communication equipment test resource scheduling method for reverse deep reinforcement learning | |
CN111079899A (en) | Neural network model compression method, system, device and medium | |
CN111382906B (en) | Power load prediction method, system, equipment and computer readable storage medium | |
CN112200296B (en) | Network model quantization method and device, storage medium and electronic equipment | |
CN110363297A (en) | Neural metwork training and image processing method, device, equipment and medium | |
CN115829024B (en) | Model training method, device, equipment and storage medium | |
CN115906303A (en) | Planar microwave filter design method and device based on machine learning | |
CN111626328A (en) | Image identification method and device based on lightweight deep neural network | |
CN113240090B (en) | Image processing model generation method, image processing device and electronic equipment | |
CN118036697B (en) | Model data processing method, apparatus and storage medium | |
CN117708710A (en) | Short-term lightweight load prediction method for power distribution area | |
CN116629341A (en) | Structure adjustment method and device for model | |
CN112182031A (en) | Data query method and device, storage medium and electronic device | |
CN114707636A (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
CN114036341B (en) | Music tag prediction method and related equipment | |
CN118245227B (en) | Computing cluster task scheduling and load balancing method based on decision tree in time window | |
CN114841242B (en) | Quality control method and device for power grid data | |
CN117707995B (en) | Optimization device for data pre-reading and operation method | |
CN118468609B (en) | Pet food production process simulation system and method | |
CN111435596B (en) | Method and device for adjusting running state of target equipment, storage medium and electronic device | |
CN117648638A (en) | Method and device for determining monitoring strategy and electronic equipment | |
Wang et al. | Optimization of the Pedestrian and Vehicle Detection Model based on Cloud-Edge Collaboration | |
CN118690818A (en) | Target task processing method and device | |
CN115396328A (en) | Network index prediction method and device and electronic equipment | |
CN116467255A (en) | Serialized physical mapping method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |