WO2021195643A1 - Compression de réseaux neuronaux convolutifs par élagage - Google Patents

Compression de réseaux neuronaux convolutifs par élagage Download PDF

Info

Publication number
WO2021195643A1
WO2021195643A1 PCT/US2021/030480 US2021030480W WO2021195643A1 WO 2021195643 A1 WO2021195643 A1 WO 2021195643A1 US 2021030480 W US2021030480 W US 2021030480W WO 2021195643 A1 WO2021195643 A1 WO 2021195643A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
neural network
pruning
filters
target
Prior art date
Application number
PCT/US2021/030480
Other languages
English (en)
Inventor
Bochen GUAN
Qinwen Xu
Weiyi Li
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/030480 priority Critical patent/WO2021195643A1/fr
Publication of WO2021195643A1 publication Critical patent/WO2021195643A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • This application relates generally to deep learning technology including, but not limited to, methods, systems, and non-transitory computer-readable media for modifying neural networks to reduce computational resource usage and improve efficiency of the neural networks.
  • CNNs Convolutional neural networks
  • Deployment of deep CNNs is often costly because such CNNs use many filters involving a large number of trainable parameters.
  • Pruning techniques have been developed to remove unimportant filters in CNNs according to certain metrics. For example, weight decay is used to increase a sparsity level of connections in the CNNs, and a structured sparsity can also be applied to regularize weights.
  • Most pruning techniques focus on the entire model and popular public datasets, require extended pruning time, and fail to converge when such techniques are applied to prune practical models that contain several networks and have complicated functions.
  • Various implementations of this application are directed to improving efficiency of a neural network by pruning filters, thereby reducing model storage usage and computation resource usage in subsequent data inference.
  • the core of filter pruning is a searching problem to identify a subset of filters to be removed for the purposes of improving a compression level of filters while reducing loss in computational accuracy.
  • a neural network is pruned gradually in a sequence of pruning operations to achieve a target model size, rather than being pruned once via a single pruning operation. Particularly, in some situations, the neural network is dilated prior to being pruned with the sequence of pruning operations.
  • each layer of filters is assigned with different importance coefficients, such that each filter is associated with an importance score determined based on the layer-based importance coefficients and ranked accordingly for filter pruning.
  • a neural network is divided into a plurality of subsets, and each subset is pruned in the context of the entire neural network. The pruned subsets are then combined to form a target neural network.
  • a method is implemented at a computer system for network pruning.
  • the method includes obtaining a neural network model having a plurality of layers. Each layer includes a respective number of filters and identifying a target model size to which the neural network model is compressed.
  • the method further includes deriving one or more intermediate model sizes from the target model size of the neural network model.
  • the one or more intermediate model sizes and the target model size form an ordered sequence of model sizes.
  • the method further includes implementing a sequence of pruning operations, and each pruning operation corresponds to a respective model size in the ordered sequence of model sizes.
  • the method further includes for each pruning operation, identifying a respective subset of filters of the neural network model to be removed based on the respective model size and updating the neural network model to pruning the respective subset of filters, thereby reducing a size of the neural network model to the respective model size.
  • the updated neural network model of each pruning operation is trained according to a predefined loss function.
  • a method is implemented at a computer system for network pruning.
  • the method includes obtaining a neural network model having a plurality of layers. Each layer has a respective number of filters.
  • the method further includes pruning the neural network models to a plurality of pruned neural network models.
  • the method includes for each pruned neural network model, assigning a respective distinct set of importance coefficients for the plurality of layers, determining an importance score of each filter based on a respective subset of importance coefficients of a respective layer to which the respective filter belongs, ranking the filters based on the importance score of each filter, and in accordance with ranking of the filters, pruning the neural network model to a respective pruned neural network model by removing a respective subset of filters.
  • a method is implemented at a computer system for pruning a neural network.
  • the method includes obtaining a neural network model having a plurality of layers. Each layer has a respective number of filters.
  • the method further includes dividing the neural network model into a plurality of neural network subsets. Each neural network subset includes a subset of distinct and consecutive layers of the neural network model.
  • the method further includes separately pruning each neural network subset while maintaining remaining neural network subsets in the neural network model and combining each pruned neural network subset to generate a target neural network model.
  • some implementations include a computer system including one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • NN-based neural network based
  • Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network (NN), in accordance with some embodiments.
  • Figure 5 is a flow diagram of a comprehensive process for simplifying a first neural network (NN) model, in accordance with some embodiments.
  • Figure 6 is a flow diagram of a subset-based filter pruning process for simplifying a neural network model, in accordance with some embodiments.
  • Figure 7 is a flow diagram of a pruning pipeline applied to simply each NN subset of a first NN network shown in Figure 6, in accordance with some embodiments.
  • Figure 8 is a flow diagram of a post-pruning process for improving model performance of an NN model (e.g., the second NN network in Figure 6) based on a precision setting of a client device 104, in accordance with some embodiments.
  • Figure 9A is a flow diagram of an importance-based filter pruning process for simplifying an NN model, in accordance with some embodiments
  • Figure 9B is a table 950 of two pruning settings defining importance coefficients for a plurality of layers of the NN model shown in Figure 9A, in accordance with some embodiments.
  • Figure 10A is a flow diagram of a multistep filter pruning process for simplifying a first NN model to a target NN model, in accordance with some embodiments.
  • Figure 10B is a flow diagram of another multistep filter pruning process involving model dilation, in accordance with some embodiments.
  • Figures 11-13 are three flow diagrams of three filter pruning methods, in accordance with some embodiments.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera).
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video, image, audio, or textual data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video, image, audio, or textual data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
  • the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • a server 102 e.g., the server 102A
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application).
  • the client device 104 A itself implements no or little data processing on the content data prior to sending them to the server 102A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
  • FIG. 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • GPS global positioning satellite
  • Memory 206 includes high-speed random access memory, such as DRAM,
  • SRAM, DDR RAM, or other random access solid state memory devices and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices.
  • Memory 206 optionally, includes one or more storage devices remotely located from one or more processing units 202.
  • Memory 206, or alternatively the non-volatile memory within memory 206 includes a non-transitory computer readable storage medium.
  • memory 206, or the non- transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • each node 420 of the NN 400 corresponds to a filter.
  • “Channel”, “filter”, “neuron”, and “node” are used in an exchangeable manner in the context of pruning of the NN 400.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w , W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data).
  • Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow diagram of a comprehensive process 500 for simplifying a first neural network (NN) model 502, in accordance with some embodiments.
  • the first NN model 502 includes a plurality of layers 504, and each layer has a plurality of filters 506.
  • a model compression module 508 is configured to simply the first NN model 502 to a second NN model 510.
  • the model compression module 508 applies one or more of: pruning, distillation, or quantization. For example, weights deemed as unnecessary to the first NN model 502 are removed via pruning.
  • the first NN model 502 has a first number of filters 506 associated with a second number of weights
  • the second NN model 510 has a third number of filters 506 associated with the fourth number of weights.
  • the second number is less than the first number.
  • the first and third numbers are equal, and however, the fourth number is less than the second number.
  • floating-point numbers of the first NN model 502 are approximated with lower bit width numbers by quantization.
  • distillation is applied to transfer knowledge from the first NN model 502 to the second NN model 510 by narrowing a difference between outputs of the first and second NN models 502 and 510. It is noted that in some embodiments, two or three of pruning, quantization, and distillation are implemented jointly to simply the first NN model 502.
  • the model compression module 508 is part of a model training module 226
  • the model compression module 508 is implemented on a server system 102, e.g., using training data provided by the server system 102 or by a storage 106.
  • the server system 102 generates the first NN model 502 by itself or obtains the first NN model 502 from a distinct server 102, storage 106, or client device 104.
  • the second NN model 510 is provided to a client device 104 to be applied for data inference.
  • the client device 104 provided with the second NN model 510 includes a mobile device having a limited computational and/or storage capability.
  • the first NN model 502 cannot operate efficiently on the client device 104.
  • the server system 102 is configured to simplify the first NN model 502 to the second NN model 510 in response to a model simplification request received from the client device 104, and the model simplification request include information associated with the limited computational and/or storage capability of the client device 104.
  • the server system 102 is configured to pre-simplify the first NN model 502 to one or more NN models including the second NN model 510 and select the second NN model 510 in response to receiving a model simplification request from the client device 104.
  • FIG. 6 is a flow diagram of a subset-based filter pruning process 600 for simplifying a neural network model 502, in accordance with some embodiments.
  • the first NN model 502 is divided into a plurality of NN subsets 602 (e.g., 602-1, 602-2, ... 602-N), and each NN subset 602 includes a subset of distinct and consecutive layers of the first NN model 502.
  • Each layer 504 only belongs to a single NN subset 602, and no layer or filter belongs to two NN subsets 602.
  • the plurality of NN subsets 602 cover less than all layers of the first NN model 502 (e.g., 7 of all 8 layers).
  • the plurality of NN subsets 602 cover all layers of the first NN model 502. For example, if the first NN model 502 includes eight layers 504 and every two layers are grouped to a respective NN subset 602, the eight layers 504 are grouped to four NN subsets in total.
  • each of the NN subsets 602 is separately pruned while remaining NN subsets 602 in the first NN model 502 remain unchanged. During the pruning operation, each NN subset 602 is trained to minimize a loss function of the first NN model 502 that combines the pruned respective NN subset 602 and unchanged remaining NN subsets 602.
  • the pruned NN subsets 602 are extracted from the first NN model 502 from the plurality of pruning pipelines 606, and combined to generate the second NN network 510 as a target NN model 604. That said, the pruned NN subsets 602 are connected to an end-to-end one-stage network that is optionally trained jointly and again based on the loss function.
  • a target NN model 604 is provided to and applied by a client device 104 for data inference.
  • the first NN model 502 is divided into a first NN subset 602-1, a second NN subset 602-2, ..., and an N-th NN subset 602-N.
  • These NN subsets 602 are pruned and trained in the context of the first NN model 502 by a plurality of pruning pipelines 606 that are executed separately and independently of each other.
  • the plurality of pruning pipelines 606 are executed concurrently and in parallel to one another, e.g., by a plurality of distinct processing units.
  • the first NN model 502A is trained based on a loss function, while the first NN subset 602-1 is being modified (e.g., to remove a first subset of filters 506 in the first NN subset 602-1 and/or set to zero a subset of weights of each of a second subset of filters 506 in the first NN subset 602-1).
  • the second NN subset 602-2 and any other remaining NN subset 602 are not modified.
  • a second pruning pipeline 606B the first NN model 502B is trained based on the same loss function, while the second NN subset 602-2 in the first NN model 502B is being modified (e.g., to remove a third subset of filters 506 in the second NN subset 602-2 and/or set to zero a subset of weights of each of a fourth subset of filters 506 in the second NN subset 602-2).
  • the first NN subset 602-1 and any other remaining NN subset 602 e.g., the N-th NN subset 602-N
  • each of the remaining NN subsets 602 is similarly pruned like the first and second NN subsets 602-1 and 602-2.
  • the first NN model 502 has only one
  • NN subset 602 pruned and the remaining NN subsets unchanged. Only one NN subset 602 is pruned at a time retains an overall accuracy of the first NN model 502, because all other unpruned NN subsets 602 are already well-trained. A size of the pruned NN subset 602 is reduced, and computational resource needed for data inference also drops. A corresponding data inference accuracy may be slightly compromised for the pruned NN subset 602, and therefore, the pruned NN subset 602 is not used in a different pipeline 606 to prune any other NN subset 602.
  • the plurality of pruning pipelines 606 prune each of the NN subsets 602 in the context of the first NN model 502 separately, the first NN model 502 that has a relatively large model size does not need to be pruned as a whole, and such a pruning task is therefore divided into the plurality of pruning pipelines 606 in a manageable and efficient manner.
  • the pruning method of each pruning pipeline 606 i.e., the pruning method 706 in Figure 7 7) can be flexibly selected, and does not have be optimized in many situations. As a result, a potentially heavy pruning process is simplified to multi-model pruning, which is progressive and easy to retrain and fine-tune.
  • the plurality of NN subsets are not shown.
  • the plurality of NN subsets 602 cover all layers of the first NN model 502, and no unpruned NN subset exists. After each NN subset 602 is pruned in the respective pruning pipeline 606, the respective NN subsets 602 are combined to one another without any unpruned NN subset to form the target NN model 604.
  • Figure 7 is a flow diagram of a pruning pipeline 606 applied to simply each
  • the pruning pipeline 606 starts with a first NN model 502 including an NN subset 702 to be pruned and one or more unchanged NN subsets 704.
  • the NN subset 702 is optionally the first NN subset 602-1, second NN subset 602-2, ..., or N-th NN subset 602-N.
  • a pruning method 706 is applied to remove a subset of filters 506 in the NN subset 702 without changing the unchanged NN subset(s) 704. For example, in accordance with the pruning method 706, an importance score is determined for each of the filters in the NN subset 702 by combining weights associated with the respective filter, and the subset of filters 506 having the lowest importance scores are selected to suppress the corresponding floating point operations per second (FLOPs) of the first NN model 502 below a target FLOPS number that measures computational resource usage of the first NN model 502.
  • FLOPs floating point operations per second
  • a pruned first NN model 708 is outputted without being re-trained or fine-tuned.
  • the pruned first NN model 708 is further tuned (710).
  • the target NN model 604 i.e., the second NN model 510 is provided to a client device 104 having a filter setting, e.g., which fits a register length of a single instruction multiple data (SIMD) computer structure.
  • SIMD single instruction multiple data
  • the client device 104 uses a CPU to run the target
  • the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 4.
  • the respective number of filters of each layer of the pruned NN subset 702 is expanded to at least the nearest multiple of 4, e.g., from 27 filters to 28 filters.
  • the client device 104 uses a graphics processing unit (GPU) to run the target NN model, and the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 16.
  • the respective number of filters of each layer of the pruned NN subset 702 has to be expanded to the nearest multiple of 16, e.g., from 27 filters to 32 filters.
  • the client device 104 uses a digital signal processor (DSP) to run the target NN model 604, and the filter setting of the client device 104 requires the number of filters in each layer of the pruned NN subset 702 to be a multiple of 32.
  • the respective number of filters of each layer of the pruned NN subset 702 has to be expanded to the nearest multiple of 32, e.g., from 27 filters to 32 filters.
  • the server system 102 is configured to simplify the first
  • the server system 102 is configured to pre-simplify the first NN model 502 to a plurality of NN model options based on a plurality of known filter settings that are often used by different client devices 104, and select the target NN model from the NN model options in response to receiving a model simplification request from the client device 104.
  • an L2 norm regularization is added to the pruning pipeline 606 and applied (710) to a predefined loss function associated with the pruned first NN model 708.
  • the L2 norm regularization corresponds to a term dedicated to weights of filters 506 of the respective pruned NN subset 602.
  • the term is associated with a square of the weights of filters 506 of the respective pruned NN subset 602.
  • the predefined loss function includes a first loss function.
  • the first NN network model Prior to dividing the first NN model 502, the first NN network model is trained according to a second loss function, and the first loss function is a combination of the second loss function and the term.
  • the pruned first NN model 708 is trained and fine-tuned (712) to provide an intermediate first NN network 714 having a newly pruned NN subset 702 in each pruning pipeline 606.
  • the L2 norm regulation controls a dynamic range of weights, thereby reducing an accuracy drop after quantization.
  • the newly pruned NN subsets 702 are obtained for the plurality of NN subsets of the first NN model 502 from the plurality of pruning pipelines 606, respectively.
  • Each of these pruned NN subsets 702 is extracted from the intermediate first NN network 714 in each pruning pipeline 606, and ready to be combined into the target NN model 604 (i.e., the second NN model 510) that is used by the client device 104.
  • FIG 8 is a flow diagram of a post-pruning process 800 for improving model performance of an NN model 802 (e.g., the second NN model 510 in Figure 6) based on a precision setting of a client device 104, in accordance with some embodiments.
  • weights associated with filters of the first NN model 502 maintain a float32 format while the plurality of NN subsets 602 are separately pruned in the plurality of pruning pipelines 606.
  • the weights of the un-pruned filters 506 in the NN model 802 are quantized to provide the target NN model 804.
  • the weights are quantized from the float32 format to an int8 , uint8 , intl6, or uintl6 format based on the precision setting of the client device 104.
  • the client device 104 uses a CPU to run the target NN model 804, and the CPU of the client device 104 processes 32 bit data.
  • the weights of the NN model 802 are not quantized, and the NN model 802 is provided to the client device 104 directly.
  • the client device 104 uses one or more GPUs to run the target NN model, and the GPU(s) process 16 bit data.
  • the weights of the NN model 802 are quantized to an inti 6 format, thereby converting the NN model 802 to the target NN model 804.
  • the client device 104 uses a DSP to run the target NN model, and the DSP processes 8 bit data.
  • the weights of the target NN model are quantized to an int8 format, thereby converting the NN model 802 to the target NN model 804.
  • the target NN model 804 After quantization of the weights, e.g., to a fixed 8-bit format, the target NN model 804 has fewer MACs and smaller size, and is hardware-friendly during deployment on the client device 104.
  • the server system 102 is configured to simplify the first
  • the server system 102 is configured to quantize the second NN model 510 pruned from the first NN model 502 to a plurality of NN model options based on a plurality of known precision settings that are often used by different client devices 104, and select the target NN model 804 from the NN model options in response to receiving a model simplification request from the client device 104.
  • a progressive multi-model compression pipeline ( Figure 6) is established for a deep neural network using multi-model parallel pruning ( Figure 7).
  • Figure 7 multi-model parallel pruning
  • fixed 8-bit end-to-end post training quantization ( Figure 8) is applied to further compress the deep neural network.
  • the L2 norm regularization (710) is optionally applied during pruning to facilitate subsequent quantization, while filter alignment is also used during pruning to improve hardware friendliness of the resulting target NN model.
  • the deep neural network is pruned to be friendly to the SIMD, it fits a corresponding hardware accelerator and increases deployment efficiency.
  • Various implementations of this application are directed to improving efficiency of a neural network by pruning filters, thereby reducing model storage usage and computational resource usage during a data inference stage.
  • the core of filter pruning is a searching problem to identify a subset of filters to be removed for the purposes of achieving a certain compression level for filters with an acceptable level of loss in computational accuracy.
  • Filter pruning is classified into predefined pruning and adaptive pruning. In predefined pruning, different metrics are applied to evaluate the importance of filters within each layer locally without changing a training loss. Model performance can be enhanced by fine-tuning after filter pruning. For example, an L2-norm of filter weights can be used as measuring importance. Alternatively, a difference between unpruned and pruned neural networks is measured and applied as an importance score. In another example, a rank of feature maps is used as an importance measure. The rank of the feature maps can provide more information than L1/L2 norms and achieve better compression results.
  • a pruned structure is learned automatically when a hyper parameter (e.g., important coefficients ai and bi) is given to determine a computation complexity.
  • An adaptive pruning method can embed a pruning demand into the training loss and employ joint-retraining optimization to find an adaptive decision. For example, Lasso regularization is used with a filter norm to force filter weights to zeros. Lasso regularization is added on a batch normalization layer to achieve pruning during training.
  • a scaling factor parameter is used to learn sparse structure pruning where filters corresponding to a scaling factor of zero are removed.
  • AutoML is applied for automatic network compression. The rationality is based on the exploration among the total space of network configurations for a final best candidate.
  • Figure 9A is a flow diagram of an importance-based filter pruning process 900 for simplifying an NN model 902, in accordance with some embodiments
  • Figure 9B is a table 950 of two pruning settings 952 and 954 defining importance coefficients for a plurality of layers 906 of the NN model 902 shown in Figure 9A, in accordance with some embodiments.
  • the filter pruning process 900 is implemented at a server system 102 to prune the NN model 902 to a target NN model 904, and the target NN model 904 is provided to a client device 104.
  • the filter pruning process 900 is implemented directly at the client device 104 to prune the NN model 902 to the target NN model 904.
  • the NN model 902 has a plurality of layers 906, and each layer 906 has a respective number of filters 908.
  • the NN model 902 is pruned to a plurality of pruned NN models 910.
  • each of the plurality of pruned NN models 910 has a pruned number of filters 908, and the NN model 902 has a first number of filters.
  • a difference of the pruned number and the first number is equal to a predefined difference value or a predefined percentage of the first number.
  • each of the plurality of pruned NN models 910 can be operated with a respective FLOPS number that is equal to or less than a predefined FLOPS number, and a respective subset of filters 908 are removed from the NN model 902 to obtain the respective pruned NN model 910 corresponding to the respective FLOPS number.
  • the target NN model 904 is selected from the plurality of pruned neural network models 910 based on a model selection criterion, e.g., by Auto Machine Learning (AutoML).
  • AutoML Auto Machine Learning
  • a respective distinct set of importance coefficients are assigned to each of the plurality of layers 906 in the NN model 902.
  • a first layer 906A is assigned with a first set of importance coefficients (e.g., ai and bi )
  • a second layer 906B is assigned with a second set of importance coefficients (e.g., ⁇ 3 ⁇ 4 and bi).
  • a third layer 906C is assigned with a third set of importance coefficients (e.g., and bi)
  • a fourth layer 906D is assigned with a fourth set of importance coefficients (e.g., a 4 and bi).
  • An importance score / is determined for each filter 908 based on the respective subset of importance coefficients of a respective layer 906 to which the respective filter 908 belongs.
  • the filter 908A is included in the third layer 906C, and an importance score / is determined based on the third subset of importance coefficients and b assigned to the third layer 906C.
  • the filters 908 of the entire NN model 902 are ranked based on the importance score / of each filter 908. In accordance with ranking of the filters 908, a respective subset of filters 908 are removed based on their importance score /, thereby allowing the NN model 902 to be pruned to the respective pruned NN model 910.
  • each of the plurality of pruned NN models 910 has a pruned number of filters 908 satisfying a predefined difference value, percentage, or FLOPS number, and the pruned number of top-ranked filters 908 are selected based on the importance score / of each filter 908 to generate the respective pruned NN model 910.
  • the distinct set of importance coefficients include a first importance coefficient ai and a second importance coefficient bi for each layer 906.
  • the first importance coefficient ai is selected from a first set of importance coefficient values in a first range
  • the second importance coefficient bi is selected from a second set of importance coefficient values in a second range.
  • the first range is equal to the second range, e.g., [0, 2] or (0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2 ⁇ .
  • the first range is not equal to the second range.
  • both of the first importance coefficient ai and second importance coefficient bi of each layer 906 are selected from a range of [0, 1.5]
  • the importance coefficients ICS-1 of the pruning setting 952 correspond to a first pruned NN model 910A
  • the importance coefficients ICS-1 of the pruning setting 954 correspond to a second pruned NN model 910B.
  • At least one of the importance coefficients corresponding to the first pruned NN model 910A is not equal to a corresponding one importance coefficient corresponding to the second pruned NN model 910B.
  • three importance coefficients are different from the first and second pruned NN models 910A and 910B, and include both of the importance coefficients ai and bi of the first layer 906 A and a second importance coefficient b 4 of the fourth layer 906D.
  • an average rank value is determined for each filter 908 in the NN model 902 to determine the respective importance score / of the respective filter 908.
  • the average rank value of each filter 908 is determined using a batch of predefined images. When the batch of predefined images are inputted into the NN model 902, each filter 908 outputs a feature map 912.
  • the average rank value of each filter 908 is determined based on characteristics of the feature map 912, and indicates an importance rank of the respective filter 908 for processing the batch of predefined images.
  • the average rank value of each filter 908 of the NN model 902 is an L2 norm of all weights of the respective filter 908.
  • the importance score / of each filter 908 is generated by combining the average rank value of the respective filter 908 and the respective subset of importance coefficients of the respective layer 906 to which the respective filter 908 belongs. For each pruned NN model 910, if the distinct set of importance coefficients include a first importance coefficient ai and a second importance coefficient bi for each layer 906, the average rank value Ri for each filter 908 in the respective layer 906 is modified to one of
  • Each of the plurality of pruned NN models 910 has a pruned number of filters 908, and the pruned number of top-ranked filters 908 are selected based on the importance score / of each filter 908 to generate the respective pruned NN model while the low-ranked filters 908 are removed.
  • the target NN model 904 is selected from the plurality of pruned neural network models 910.
  • the target NN model 904 is selected from the plurality of pruned NN models 910 by training each of the plurality of pruned NN models for a predefined number of cycles and selecting the target NN model 904 that has a loss function result better than any other pruned NN models 910.
  • each of the plurality of pruned NN models 910 is completely trained, e.g., to minimize a corresponding loss function, and the target NN model 904 is selected if the target NN model 904 has used the least number of training cycles among the plurality of pruned NN models 910.
  • the target NN model 904 corresponds to the least number of FLOPS among the plurality of pruned NN models, e.g., when the NN model 902 is pruned to the plurality of pruned NN models 910 having the same pruned number of filters 908.
  • the target NN model 904 includes a plurality of weights associated with the respective number of filters 908 of each layer 906.
  • the NN model 902 and pruned NN models 910 maintain a float32 format to obtain the target NN model 904.
  • the plurality weights of the target NN model 904 are quantized (912), e.g., to an int8 , uint8 , inti 6, or uintl6 format based on the precision setting of the client device 104.
  • the client device 104 uses a CPU to run the target NN model 904, and the CPU of the client device 104 processes 32 bit data.
  • the weights of the target NN model 904 are not quantized, and the target NN model 904 is provided to the client device 104 directly.
  • the client device 104 uses one or more GPUs to run the target NN model 904, and the GPU(s) process 16 bit data.
  • the weights of the target NN model 904 are quantized to an inti 6 format.
  • the client device 104 uses a DSP to run the target NN model, and the DSP processes 8 bit data.
  • the weights of the target NN model 904 are quantized to an int8 format.
  • the filter pruning process 900 is applied based on automated learned, global, and high rank (LGHR) feature maps.
  • This filter pruning process 900 optionally takes into account both a rank of feature maps 912 and Auto Machine Learning (AutoML) filter pruning.
  • the LGHR feature maps provide a global ranking of the filters 908 across different layers 906 in the NN model 902. Hyper parameters of LGHR feature maps (e.g., important coefficients cn and bi) are automatically searched, thereby reducing human labor and time for parameter settings.
  • each filter 908 corresponds to a feature map generated at an output of the respective filter 908 applied to combine a previous layer 906 in the NN model 902.
  • the filter pruning process 900 is directed to global filter pruning that uses high rank feature maps, and includes three stages: a rank generation stage, a search space generation stage, and an evaluation stage.
  • Feature maps 912 are generated from the filters 908 the NN model 902, and used to generate the average rank values of the feature maps 912 and determine importance scores I of the filters in the NN model 902.
  • a regularized evolutionary algorithm is optionally applied to generate a search space based on the importance scores of the filters 908 and fine tune a candidate pruned architecture (i.e., a selected pruned NN model 910).
  • the rank generation stage a batch of images from dataset are used to run through the layers 906 of the NN model 902 to get the feature maps 912 and estimate the average rank value of each feature map 912 associated with a respective filter 908. [0067] During the search space generation stages, the low rank feature maps or filters
  • the importance score / of a filter 408 in each layer 906 is determined by one of the following two equations: where l is an index of a layer 906, and ai and bi are two leamable parameters (also called importance coefficients) for global modification, which can scale and shift the importance score / of filters 908 in each layer 906.
  • R is the average rank value of all the feature maps of a CNN layer.
  • LGHR ranks the filters by the importance score / of each filter 908 and remove low score filters 908.
  • a regularized evolutionary algorithm EA is applied to generate a network architecture search space as explained above.
  • the network architecture search space includes the plurality of pruned NN models 910, and each pruned NN model 910 is generated based on a distinct set of importance coefficients for the plurality of layers 906.
  • the target NN model 904 is selected from the pruned network search space and fine-tuned via several gradient steps. A loss between the NN models 902 and 904 is used to select the importance coefficients of the layers 906. In some situations, the importance coefficients are reset iteratively an optimized target NN model 904 is identified. In some embodiments, first importance coefficients of layers 906 are selected in a first range to generate a first batch of pruned NN models 910.
  • One or two first target NN models 904 are selected from the plurality of pruned NN models 910, and the first importance coefficients corresponding to the one or two first target NN models 904 are selected to narrow down the first range of importance coefficients to a second range.
  • Second importance coefficients of layers 906 are selected in the second range to generate a second batch of pruned NN models 910.
  • One or two second target NN models 904 are selected from the second batch of pruned NN models 910.
  • the target NN model 904 is outputted to the client device 104, or continues to narrow down the second range for importance coefficients iteratively.
  • LGHR an importance score measure is used based on the rank of the feature maps associated with different filters 908.
  • An AutoML pipeline is optionally used to search a target NN model 904.
  • LGHR takes into account both feature map ranking and AutoML filter pruning.
  • LGHR provide a global ranking solution to combine filters 908 across different layers 906.
  • Global ranking analysis makes it easy to set a pruning target and find an optimal target NN model 904.
  • LGHR modified adaptive paining, e.g., using low rank feature maps.
  • LGHR uses AutoML to learn hyper-parameters that can greatly reduce the workload and time for hyper-parameter settings.
  • FIG 10A is a flow diagram of a multistep filter pruning process 1000 for simplifying a first NN model 1002 to a target NN model 1004, in accordance with some embodiments.
  • the first NN model 1002 has a plurality of layers 1006, and each layer 1006 has a plurality of filters 1008.
  • the first NN model 1002 has a first model size, and is operated with at least a first computational resource usage that is measured in FLOPS.
  • the first NN model 1002 is required to be compressed to a target model size corresponding to a target computational resource usage measured in FLOPS.
  • the first NN model 1002 is operated with at least 4G floating point operations per second (i.e., 4G FLOPS), and needs to be compressed by 75% to the target NN model 1004 having a target model size corresponding to 1G FLOPS or less.
  • 4G FLOPS 4G floating point operations per second
  • a single pruning operation is implemented to prune the first NN model 1002 down to the target model size.
  • a sequence of pruning operations 1010 are implemented to reach the target model size.
  • the intermediate model sizes are determined to approach the target model size gradually from the first NN model 1002, e.g., with an equal or varying step size.
  • the first model size and the target model size are used to derive one or more intermediate model sizes.
  • the one or more intermediate model sizes and the target model size form a sequence of decreasing model sizes ordered according to magnitudes of these model sizes.
  • each pruning operation 1010 corresponds to a respective model size in the ordered sequence of model sizes.
  • each pruning operation 1010 a respective subset of filters 1008 of the first NN model 1002 are identified to be removed based on the respective model size, and the first NN model 1002 is updated to remove the respective subset of filters 1008, thereby reducing a size of the first NN model 1002 to the respective model size.
  • each updated first NN model 1012 is optionally trained according to a predefined loss function. That said, after the respective subset of filters 1008 is removed, weights of remaining filters 1008 of the updated first NN model 1012 are adjusted based on the predefined loss function during training.
  • the first NN model 1002 is operated with the first model size of
  • the target model size is 1G FLOPS.
  • Two intermediate model sizes of 2G FLOPS and 3G GLOPS are derived based on the first and second model sizes.
  • the ordered sequence of model sizes are 3G, 2G, and 1G FLOPS.
  • a sequence of 3 pruning operations 1010 are implemented to reduce the computational resource usage of the first NN model 1002 from 4G FLOPS to 3G, 2G, and 1G FLOPS, successively and respectively.
  • the updated first NN model 1012 is trained such that the weights of the remaining filters 1008 are adjusted based on the predefined loss function.
  • the first NN model 1002 is operated with the first model size of 1000 filters, and the target model size is 400 filters.
  • Two intermediate model sizes of 800 and 600 filters are derived.
  • a sequence of 3 pruning operations 1010 are implemented to reduce the first NN model 1002 from 1000 filters to 800, 600, and 400 filters, successively and respectively.
  • one or more intermediate model sizes are determined to form a sequence of decreasing model sizes having varying step sizes.
  • An example of the sequence is 800, 500, and 400 filters.
  • the updated NN model 1012 having the target model size (i.e., the target NN model 1004) is provided to a client device 104, and the target model size satisfies a target computation criterion associated with the client device 104.
  • the client device 104 is a mobile phone, and the target model size is 100 filters.
  • the client device 104 is a tablet computer, and the target model size is 400 filters.
  • each of the pruning operations 1010 receives an input
  • a first pruning operation receives the first NN model 1002, and any following pruning operation receives the updated first NN model 1012 of an immediately preceding pruning operation 1010.
  • an importance factor is determined for each filter 1008 of the input NN model, and a respective subset of filters 1008 that having the smallest importance scores are selected among the filters of the input NN model. Further, in some embodiments, the importance factor of each filter 1008 is determined based on a sum of weights of the respective filter 1008 applied to convert inputs of the respective filter 1008. This sum is optionally a weighted sum of the weights of the respective filter 1008.
  • the importance factor of each filter 1008 is based on an LI norm, e.g., is an unweighted sum or a weighted sum of an absolute value of the weights of the respective filter 1008. In some situations, the importance factor of each filter 1008 is based on an L2 norm, e.g., is a square root of an unweighted sum or a weighted sum of squares of the weights of the respective filter 1008. For each pruning operation 1010, the important factors of the filters 1008 of the input NN model (which is a subset of the first NN model 1002) are ranked and applied to determine the subset of filters 1008 to be removed.
  • FIG. 10B is a flow diagram of another multistep filter pruning process 1050 involving model dilation, in accordance with some embodiments.
  • the filter pruning process 1050 includes two stages: dilating a first NN model 1002 to a dilated NN model 1014 and pruning the dilated NN model 1014 by a sequence of pruning operations to a target NN model 1004.
  • the first NN model 1002 is dilated, such that a number of filters is increased by 1.5 or 2 times.
  • the number of filters is increased in the dilated NN model 1014 compared with the first NN model 1002 from which the dilated NN model 1014 is dilated.
  • one or more supplemental layers of filters 1008 are added to the plurality of layers 1006 of the first NN model 1002. In some embodiments, one or more supplemental filters 1008 are selectively added to to each of a subset of the plurality of layers 1006 of the first NN model 1002.
  • the dilated NN model 1014 has a dilated model size, and the one or more intermediate sizes are derived based on the dilated model size and the target model size that is determined based on the client device 104 receiving the target NN model 1004.
  • the sequence of pruning operations 1010 are initiated from the dilated NN model 1014.
  • the first NN model 1002 is operated with the first model size of 4G FLOPS, and the target model size is 1G FLOPS.
  • the first model size is dilated to 6G Flops.
  • One intermediate model size of 3.5G FLOPS is derived based on the dilated and target model sizes.
  • the ordered sequence of model sizes are 3.5G, and 1G FLOPS.
  • a sequence of 2 pruning operations 1010 are implemented to reduce the computational resource usage of the first NN model 1002 from 6G FLOPS to 3.5G and 1G, successively and respectively.
  • the sequence of model sizes are not equally spaced.
  • Three intermediate model sizes of 4.5G, 3G, and 2G FLOPS are derived based on the dilated and target model sizes.
  • a sequence of four pruning operations 1010 are implemented to reduce the computational resource usage of the dilated NN model 1014 from 6G FLOPS to 4.5G, 3G, 2G, and 1G FLOPS, successively and respectively.
  • the multistep filter pruning process 1050 optionally includes more pruning operations than the multistep filter pruning processing 1000 to reach the same target model size, while each pruning operation is similarly implemented, e.g., based on importance scores of the filters 1008 calculated based on the LI or L2 norm .
  • each updated first NN model 1012 is optionally trained according to a predefined loss function. After the respective subset of filters 1008 is removed, weights of remaining filters 1008 of the updated first NN model 1012 are adjusted based on the predefined loss function during training. Alternatively, in some embodiments, the updated first NN model 1012 is not trained after each pruning operation 1010. Rather, the updated first NN model 1012 obtained after the entire sequence of pruning operations 1010 is trained based on the predefined loss function, and fine-tuned to the target NN model 1004.
  • the target NN model 1004 includes a plurality of weights associated with the respective number of filters 1008 of each layer 1006.
  • the first NN model 1002 and updated first NN model 1012 generated by each pruning operation 1010 maintain a float32 format.
  • the plurality weights of the target NN model 904 are quantized (1016), e.g., to an int8 , uint8 , inti 6, or uintl6 format based on the precision setting of the client device 104 that is configured to receive and use the target NN model 1004.
  • the client device 104 receiving and using the target NN network 1004 has a filter setting that defines numbers of filters fitting a register length of a SIMD computer structure.
  • the updated NN model 1012 is tuned based on the filter setting.
  • the number of filters in each layer of the updated NN model 1012 is expanded based on the filter setting of the client device 104. For example, the number of filters in each layer of the updated NN model 1012 is expanded to a multiple of 8, 16, or 32 based on the filter setting of the client device 104.
  • each pruning operation 1010 is configured to reach a distinct model size or a distinct computational resource usage that is measured in FLOPS.
  • the distinct model size or resource usage decreases gradually with the respective pruning operation in the sequence of pruning operations.
  • the sequence of pruning operations are thereby implemented to prune the first NN model 1002 ( Figure 10A) or dilated NN model 1014 ( Figure 10B) to reach each distinct model size or computational resource usage successively.
  • Different pruning methods can be applied in each pruning operation 1010 of the multistep filter pruning processes 1000 and 1050.
  • Different pruning operations 1010 in the same sequence can use different pruning methods, such as filter-wise pruning and AutoML pruning.
  • the multistep filter pruning process 1050 can provide the target NN model 1004 with a high compression rate (e.g., greater than a threshold compression rate) that is hard to be reached by a single pruning operation.
  • Figures 11-13 are three flow diagrams of three filter pruning methods 1100,
  • each of the methods 1100, 1200, and 1300 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • a computer system e.g., a client device 104, a server 102, or a combination thereof.
  • An example of the client device 104 is a mobile phone.
  • each of the methods 1100, 1200, and 1300 is applied to prune filters of a corresponding neural network model.
  • Each of the methods 1100, 1200, and 1300 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figures 11-13 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in each of the methods 1100, 1200, and 1300 may be combined and/or the order of some operations may be changed.
  • a computer system obtains (1102) a neural network model 502 having a plurality of layers 504, and each layer 504 has a respective number of filters 506.
  • the neural network model 502 is divided (1104) into a plurality of neural network subsets 602, and each neural network subset 602 includes a subset of distinct and consecutive layers 504 of the neural network model 502.
  • the computer system separately prunes (1106) each neural network subset 602 while maintains remaining neural network subsets 602 in the neural network model 502. In some embodiments, two or more neural network subsets 602 are pruned concurrently and in parallel.
  • Each pruned neural network subset is combined (1108) to one another to generate a target neural network model 510, i.e., the plurality of neural network subsets 602 that are pruned are combined to the target neural network model 510.
  • the target neural network model 510 is trained (1110) according to a predefined loss function.
  • the target neural network model 510 is provided (1110) to an electronic device (e.g., a client device 104). While pruning each neural network subset 602, the computer system controls (1112) the respective number of filters in each layer 504 of the respective neural network subset 602 according to a filter setting of the electronic device. Further, in some embodiments, each layer of the target neural network 510 has a respective updated number of filters, and the respective updated number is a multiple of 4.
  • the neural network model 502 includes a plurality of weights associated with the respective number of filters 506 of each layer 504.
  • the computer system maintains a float32 format for the plurality of weights while separately pruning each neural network subset 602.
  • the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , intl6 or uintl6 format.
  • the plurality of weights are quantized based on a precision setting of an electronic device.
  • the target neural network model 510 having quantized weights is provided to the electronic device.
  • each neural network subset 602 is separately pruned.
  • the computer system updates the neural network model 502 by replacing the respective neural network subset with a respective pruned neural network subset 610 (e.g., 610-1), training the updated neural network model (e.g., in the pruning pipeline 606A) according to a predefined loss function, and extracting the respective neural network subset 602 from the updated neural network model after the updated neural network model is trained.
  • the predefined loss function includes a term dedicated to weights of filters of the respective pruned neural network subset 610.
  • the predefined loss function applied to train the updated neural network model in the pruning pipeline 606 includes a first loss function. Prior to dividing the neural network model, the computer system trains the neural network model 502 according to a second loss function. The first loss function is a combination of the second loss function and the term.
  • the computer system selects a respective set of filters 506 to be removed from respective neural network subset 602 based on a pruning method. For example, an importance score is determined for each filter 506 in the respective neural network subset 602, and the respective set of filters 506 having the smallest importance scores are selected among the filters 506 in the respective neural network subset 602. The respective set of filters 506 are removed from the respective neural network subset to obtain the respective pruned neural network subset 610.
  • a computer system obtains (1202) a neural network model 902 having a plurality of layers 906, and each layer 906 has a respective number of filters 908.
  • the computer system prunes (1204) the neural network model 902 to a plurality of pruned neural network models 910.
  • the computer system assigns (1206) a respective distinct set of importance coefficients for the plurality of layers 906, and determines (1208) an importance score /of each filter 908 based on the respective distinct set of importance coefficients.
  • the respective distinct set of importance coefficients includes a subset of importance coefficients for a respective layer to which the respective filter belongs.
  • each pruned neural network model 910 the computer system ranks (1210) the filters 908 based on the importance score of each filter 908, and in accordance with ranking of the filters, prunes (1212) the neural network model 902 to the respective pruned neural network model 910 by removing a respective subset of filters.
  • the target neural network model 904 is selected (1214) from the plurality of pruned neural network models 910 based on a model selection criterion.
  • the computer system determines an average rank value for each filter 908 of the neural network model 902.
  • the importance score of each filter 908 is determined by combining the average rank value of the respective filter 908 and the subset of importance coefficients of the respective layer to which the respective filter belongs.
  • the distinct set of importance coefficients include a first importance coefficient ai and a second importance coefficient bi for each layer 906.
  • the first importance coefficient ai is selected from a first set of importance coefficients in a first range
  • the second importance coefficient 3 ⁇ 4 is selected from a second set of importance coefficients in a second range.
  • the first range is optionally equal to or distinct from the second range.
  • the average rank value Ri for each filter 908 in the respective layer 906 is modified to cq
  • the average rank value Ri for each filter 908 in the respective layer 906 is modified to cq
  • At least one of the first and second importance coefficients for at least one layer is distinct for every two pruning settings (e.g., ICS-1 and ICS-2 in Figure 9B) of two distinct pruned neural network models 910.
  • Each pruning setting corresponds to the respective distinct set of importance coefficients for the plurality of layers 906 of a respective pruned neural network model 910.
  • the average rank value for each filter 908 of the neural network model 902 is determined using a batch of predefined images.
  • Each filter 908 outputs a depth map that is applied to generate the respective average rank value Ri that is used to generate the respective importance factor / of the respective filter 908.
  • each of the plurality of pruned neural network models is configured to:
  • the target neural network model 904 is trained for a predefined number of cycles.
  • the target neural network model 904 that has a loss function result better than any other pruned neural network models is selected from the plurality of pruned neural network models 910.
  • each of the plurality of pruned neural network models 910 is trained completely (e.g., until a loss function has been minimized).
  • the target neural network model 904 that uses the least number of training cycles is selected from the plurality of pruned neural network models 910.
  • the target neural network model 904 corresponds to computational source usage having the least number of floating point operations per second (FLOPS) among the plurality of pruned neural network models 910.
  • FLOPS floating point operations per second
  • the neural network model 902 includes a plurality of weights associated with the respective number of filters 908 of each layer 906.
  • the computer system maintains a float32 format for the plurality of weights while pruning the neural network model 902.
  • the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , inti 6 or uintl6 format.
  • the plurality of weights are quantized based on a precision setting of an electronic device.
  • the target neural network model 904 having quantized weights is provided to the electronic device.
  • a computer system obtains (1302) a neural network model 1002 having a plurality of layers 1006, and each layer 1006 has a respective number of filters 1008.
  • the computer system identifies (1304) a target model size to which the neural network model is compressed.
  • One or more intermediate model sizes are derived (1306) from the target model size of the neural network model 1002.
  • the one or more intermediate model sizes and the target model size form (1308) an ordered sequence of model sizes.
  • the computer system implements (1310) a sequence of pruning operations 1010, e.g., to compress the neural network model 1002 gradually. Each pruning operation corresponds to a respective model size in the ordered sequence of model sizes.
  • the computer system For each pruning operation, the computer system identifies (1312) a respective subset of filters 1008 of the neural network model 1002 to be removed based on the respective model size, and updates (1314) the neural network model to remove the respective subset of filters 1008, thereby reducing a size of the neural network model 1102 to the respective model size.
  • An order of each pruning operation in the sequence of pruning operations 1010 is consistent with an order of the respective model size in the sequence of model sizes.
  • the updated neural network model 1012 is trained according to a predefined loss function. Weights of unpruned filters of the updated neural network model 1012 are adjusted, e.g., to minimize the predefined loss function.
  • the target model size of the neural network model 1002 satisfies a target computation criterion associated with an electronic device, and a target neural network model pruned by the sequence of pruning operations 1010 is provided to the electronic device.
  • the computer system determines each of the one or more intermediate model sizes to approach the target computation criterion gradually with the one or more intermediate model sizes, e.g., with equal step sizes or varying step sizes.
  • the neural network model 1002 is obtained with a first model size, and the one or more intermediate model sizes are equally distributed between the original model size and the target model size.
  • the respective subset of filters of the neural network model 101 are identified for each pruning operation 1010 by determining an importance score for each filter 1008 of the neural network model 1002 or 1014 (if dilated) and selecting the respective subset of filters 1008 that having the smallest importance scores I among the filters 1008 of the neural network model. Further, in some embodiments, to determine the importance score for each filter 1008, the computer system determines a sum of weights of the respective filter 1008 that are applied to convert inputs of the respective filter 1008, and associates the importance score I of the respective filter with the sum of weights of the respective filter.
  • the computer system dilates the neural network model.
  • the dilated neural network model 1014 has a dilated model size.
  • the one or more intermediate model sizes are derived based on the dilated model size and the target model size of the neutral network model, and the sequence of pruning operation is initiated on the dilated neural network model 1014.
  • the one or more intermediate model sizes are equally distributed between the dilated model size and the target model size.
  • the one or more intermediate model sizes are not equally distributed between the dilated model size and the target model size.
  • the neural network model is dilated to increase a size of the neural network model by a predefined ratio. Additionally, in some embodiments, the neural network model 1002 is dilated by at least one of: adding one or more supplemental layers to the plurality of layers 1006 of the neural network model 1002 and adding one or more supplemental filters to each of a subset of the plurality of layers 1006 of the neural network model 1002. Further, in some embodiments, the dilated neural network model 1014 is re-trained according to the predefined loss function. [0097] In some embodiments, the neural network model 1002 includes a plurality of weights associated with the respective number of filters 908 of each layer 906.
  • the computer system maintains a float32 format for the plurality of weights during the sequence of pruning operations.
  • the computer system quantizes the plurality weights, e.g., from a float32 format to an int8 , uint8 , inti 6 or uintl6 format. Further, in some embodiments, the plurality of weights are quantized based on a precision setting of an electronic device.
  • the target neural network model 904 having quantized weights is provided to the electronic device.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

Un système informatique obtient un modèle de réseau neuronal ayant une pluralité de couches, et chaque couche a un nombre respectif de filtres. Le système informatique identifie une taille de modèle cible à laquelle le modèle de réseau neuronal est comprimé et dérive une ou plusieurs tailles de modèle intermédiaire à partir de la taille de modèle cible du modèle de réseau neuronal. La ou les tailles de modèle intermédiaire et la taille de modèle cible forment une séquence ordonnée de tailles de modèle. Le système informatique met en œuvre une séquence d'opérations d'élagage dont chacune correspond à une taille de modèle respective dans la séquence ordonnée de tailles de modèle. Pour chaque opération d'élagage, le système informatique identifie un sous-ensemble respectif de filtres à éliminer sur la base de la taille de modèle respective et met à jour le modèle de réseau neuronal pour élaguer le sous-ensemble respectif de filtres, ce qui permet de réduire la taille du modèle de réseau neuronal à la taille de modèle respective.
PCT/US2021/030480 2021-05-03 2021-05-03 Compression de réseaux neuronaux convolutifs par élagage WO2021195643A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/030480 WO2021195643A1 (fr) 2021-05-03 2021-05-03 Compression de réseaux neuronaux convolutifs par élagage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/030480 WO2021195643A1 (fr) 2021-05-03 2021-05-03 Compression de réseaux neuronaux convolutifs par élagage

Publications (1)

Publication Number Publication Date
WO2021195643A1 true WO2021195643A1 (fr) 2021-09-30

Family

ID=77892699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/030480 WO2021195643A1 (fr) 2021-05-03 2021-05-03 Compression de réseaux neuronaux convolutifs par élagage

Country Status (1)

Country Link
WO (1) WO2021195643A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114037844A (zh) * 2021-11-18 2022-02-11 西安电子科技大学 基于滤波器特征图的全局秩感知神经网络模型压缩方法
KR102461997B1 (ko) * 2021-11-15 2022-11-04 주식회사 에너자이(ENERZAi) 신경망 모델의 경량화 방법, 신경망 모델의 경량화 장치, 및 신경망 모델의 경량화 시스템
KR102461998B1 (ko) * 2021-11-15 2022-11-04 주식회사 에너자이(ENERZAi) 신경망 모델의 경량화 방법, 신경망 모델의 경량화 장치, 및 신경망 모델의 경량화 시스템
WO2023102844A1 (fr) * 2021-12-09 2023-06-15 北京大学深圳研究生院 Procédé et appareil pour déterminer un module d'élagage, et support de stockage lisible par ordinateur
WO2023172293A1 (fr) * 2022-03-11 2023-09-14 Tencent America LLC Procédé de quantification pour accélérer l'inférence de réseaux neuronaux

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LI GUAN; WANG JUNPENG; SHEN HAN-WEI; CHEN KAIXIN; SHAN GUIHUA; LU ZHONGHUA: "CNNPruner: Pruning Convolutional Neural Networks with Visual Analytics", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, IEEE, USA, vol. 27, no. 2, 13 October 2020 (2020-10-13), USA, pages 1364 - 1373, XP011834031, ISSN: 1077-2626, DOI: 10.1109/TVCG.2020.3030461 *
MICHAEL ZHU; SUYOG GUPTA: "To prune, or not to prune: exploring the efficacy of pruning for model compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 5 October 2017 (2017-10-05), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081283371 *
REZA ABBASI-ASL; BIN YU: "Structural Compression of Convolutional Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 May 2017 (2017-05-20), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081627444 *
SARA ELKERDAWY; MOSTAFA ELHOUSHI; ABHINEET SINGH; HONG ZHANG; NILANJAN RAY: "To filter prune, or to layer prune, that is the question", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 July 2020 (2020-07-11), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081719668 *
SONG HAN, POOL JEFF, TRAN JOHN, DALLY WILLIAM J: "Learning both Weights and Connections for Efficient Neural Networks", 30 October 2015 (2015-10-30), XP055396330, Retrieved from the Internet <URL:https://arxiv.org/pdf/1506.02626.pdf> [retrieved on 20170804] *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102461997B1 (ko) * 2021-11-15 2022-11-04 주식회사 에너자이(ENERZAi) 신경망 모델의 경량화 방법, 신경망 모델의 경량화 장치, 및 신경망 모델의 경량화 시스템
KR102461998B1 (ko) * 2021-11-15 2022-11-04 주식회사 에너자이(ENERZAi) 신경망 모델의 경량화 방법, 신경망 모델의 경량화 장치, 및 신경망 모델의 경량화 시스템
CN114037844A (zh) * 2021-11-18 2022-02-11 西安电子科技大学 基于滤波器特征图的全局秩感知神经网络模型压缩方法
WO2023102844A1 (fr) * 2021-12-09 2023-06-15 北京大学深圳研究生院 Procédé et appareil pour déterminer un module d'élagage, et support de stockage lisible par ordinateur
WO2023172293A1 (fr) * 2022-03-11 2023-09-14 Tencent America LLC Procédé de quantification pour accélérer l'inférence de réseaux neuronaux

Similar Documents

Publication Publication Date Title
WO2021195643A1 (fr) Compression de réseaux neuronaux convolutifs par élagage
US11605019B2 (en) Visually guided machine-learning language model
US11604822B2 (en) Multi-modal differential search with real-time focus adaptation
WO2021178981A9 (fr) Compression de réseaux de neurones artificiels multi-modèles compatible matériel
US10303978B1 (en) Systems and methods for intelligently curating machine learning training data and improving machine learning model performance
US10296848B1 (en) Systems and method for automatically configuring machine learning models
WO2021184026A1 (fr) Fusion audiovisuelle avec attention intermodale pour la reconnaissance d&#39;actions vidéo
CN109359725B (zh) 卷积神经网络模型的训练方法、装置、设备及计算机可读存储介质
WO2021081562A2 (fr) Modèle de reconnaissance de texte multi-tête pour la reconnaissance optique de caractères multilingue
JP2010504593A (ja) 分類手法を用いて画像からドミナントカラーを抽出する方法
US20240037948A1 (en) Method for video moment retrieval, computer system, non-transitory computer-readable medium
JPWO2009035108A1 (ja) 対応関係学習装置および方法ならびに対応関係学習用プログラム、アノテーション装置および方法ならびにアノテーション用プログラム、および、リトリーバル装置および方法ならびにリトリーバル用プログラム
US20230162477A1 (en) Method for training model based on knowledge distillation, and electronic device
JP2008542911A (ja) メトリック埋め込みによる画像比較
WO2021034941A1 (fr) Procédé de récupération et de groupement multimodaux à l&#39;aide d&#39;une cca profonde et d&#39;interrogations par paires actives
US11941376B2 (en) AI differentiation based HW-optimized intelligent software development tools for developing intelligent devices
WO2021092631A2 (fr) Récupération de moment vidéo à base de texte faiblement supervisé
CN115878832B (zh) 基于精细对齐判别哈希的海洋遥感图像音频检索方法
WO2021195644A1 (fr) Élagage de filtre global de réseaux neuronaux à l&#39;aide de cartes de caractéristiques de rang élevé
WO2020100738A1 (fr) Dispositif, procédé et programme de traitement
Lin et al. Domestic activities clustering from audio recordings using convolutional capsule autoencoder network
WO2023001940A1 (fr) Procédés et systèmes de génération de modèles pour prédiction de pipeline d&#39;analyse d&#39;image
WO2023018423A1 (fr) Incorporation binaire sémantique d&#39;apprentissage pour des représentations vidéo
CN111354372B (zh) 一种基于前后端联合训练的音频场景分类方法及系统
WO2020151318A1 (fr) Procédé et appareil de construction de corpus fondés sur un modèle de collecteur, et dispositif informatique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21776519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21776519

Country of ref document: EP

Kind code of ref document: A1