WO2021253440A1 - Depth-wise over-parameterization - Google Patents

Depth-wise over-parameterization Download PDF

Info

Publication number
WO2021253440A1
WO2021253440A1 PCT/CN2020/097221 CN2020097221W WO2021253440A1 WO 2021253440 A1 WO2021253440 A1 WO 2021253440A1 CN 2020097221 W CN2020097221 W CN 2020097221W WO 2021253440 A1 WO2021253440 A1 WO 2021253440A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameterization
over
convolutional layer
depth
neural network
Prior art date
Application number
PCT/CN2020/097221
Other languages
French (fr)
Inventor
Yangyan LI
Ying Chen
Jinming CAO
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to CN202080100158.XA priority Critical patent/CN115461754A/en
Priority to PCT/CN2020/097221 priority patent/WO2021253440A1/en
Publication of WO2021253440A1 publication Critical patent/WO2021253440A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • Convolutional neural networks are a type of deep learning method, which are capable of expressing highly complicated functions.
  • a convolutional neural network may receive an image as an input, assign importance (i.e., learnable weights and biases) to various aspects/objects in the image, and differentiate one aspect/object from other aspects/objects. Accordingly, the convolutional neural networks have been widely used in a variety of computer vision applications, such as image classification, object detection, and semantic segmentation, etc.
  • a convolutional neural network is made up of a plurality of distinct layers.
  • an accuracy of a convolutional neural network relies heavily on the depth of the network, i.e., the larger the number of layers, the higher the accuracy of the network will be.
  • increasing the number of layers in a convolutional neural network may increase the complexity of the network, and lead to a higher amount of computations, an increased cost (in terms of processing and storage resources) , and a longer delay in computations, etc.
  • the increased complexity of the convolutional neural network may increase the time taken for convergence of different variables or parameters of the convolutional neural network (i.e., the time taken for training the convolutional neural network) , and the time taken for computing a result when a prediction or inference is subsequently performed using the convolutional neural network.
  • an over ⁇ parameterization may be performed on spatial dimensions of an original parameter matrix of a convolutional layer of a neural network, to convert the original parameter matrix of the convolutional layer into two parameter matrices, with a first parameter matrix and a second parameter matrix.
  • training may then be performed on the neural network to determine values for learnable parameters of the first parameter matrix and the second parameter matrix.
  • the first parameter matrix and the second parameter matrix may be combined as a single parameter matrix and treated as a single convolutional layer, which may then be used in subsequent inference applications of the neural network.
  • FIG. 1 illustrates an example environment in which an example over ⁇ parameterization system may be used.
  • FIG. 2A illustrates the example over ⁇ parameterization system in more detail.
  • FIG. 2B illustrates an example neural network processing architecture that can be used for implementing the example over ⁇ parameterization system.
  • FIG. 2C illustrates an example cloud system that incorporates the example neural network processing architecture to implement the example over ⁇ parameterization system.
  • FIG. 3 illustrates an example neural network in which an over ⁇ parameterization may be performed.
  • FIG. 4 illustrates an example normal convolution
  • FIG. 5 illustrates an example depth ⁇ wise convolution.
  • FIGS. 6A and 6B illustrate example ways of a depth ⁇ wise over ⁇ parameterization operator.
  • FIGS. 7A and 7B illustrate example ways of obtaining a depth ⁇ wise over ⁇ parameterized depth ⁇ wise convolutional layer.
  • FIG. 8 illustrates an example method of over ⁇ parameterization.
  • the over ⁇ parameterization system may virtually convert a convolutional layer of a neural network into at least two linear layers, which are combined back into a single layer after training.
  • the over ⁇ parameterization system may perform an over ⁇ parameterization on spatial dimensions of a parameter matrix that represents the convolutional layer to obtain a first parameter matrix and a second parameter matrix, which virtually correspond to the two linear layers.
  • the first parameter matrix may constitute an additional layer that is related to depth ⁇ wise over ⁇ parameterization
  • the second parameter matrix may have a size that is the same as that of the original parameter matrix of the convolutional layer.
  • the over ⁇ parameterization system may initiate trainable parameters of the neural network randomly or based on a priori knowledge.
  • the trainable parameters may include, but are not limited to, additional parameters included in the first parameter matrix, parameters of the second parameter matrix, and model parameters of other layers of the neural network.
  • the over ⁇ parameterization system may then perform training of the entire neural network based on a training algorithm (e.g., a gradient descent based optimization) using training data to obtain resulting values of the trainable parameters, i.e., a trained neural network.
  • a training algorithm e.g., a gradient descent based optimization
  • the over ⁇ parameterization system may combine the at least two linear layers back into the convolutional layer. For example, the over ⁇ parameterization system may linearly combine the first parameter matrix and the second parameter matrix into a single parameter matrix, which represents the convolutional layer. The over ⁇ parameterization system or another system may then employ the trained neural network to perform inferences or predictions according to an intended application (such as an image classification, etc. ) .
  • the over ⁇ parameterization system involves increasing the number of trainable parameters associated with a convolutional layer of a neural network by performing an over ⁇ parameterization on a parameter matrix of the convolutional layer to form at least two parameter matrices.
  • the over ⁇ parameterization i.e., adding a linear layer
  • the over ⁇ parameterization actually helps accelerating the convergence of trainable parameters of the neural network, and thus increases the speed of training.
  • the at least two parameter matrices are combined into a single parameter matrix (i.e., at least two linear layers represented by the at least two parameter matrices are combined back into a single convolutional layer) after training, an amount of computations associated with such single convolutional layer is equivalent or similar to that of a convolutional layer without the over ⁇ parameterization being performed, when the trained neural network is employed for performing inferences or predictions according to an intended application (such as an image classification, etc. ) .
  • over ⁇ parameterization system functions described herein to be performed by the over ⁇ parameterization system may be performed by multiple separate units or services.
  • the over ⁇ parameterization system may be implemented as a combination of software and hardware implemented and distributed in multiple devices, in other examples, the over ⁇ parameterization system may be implemented and distributed as services provided in one or more computing devices over a network and/or in a cloud computing architecture.
  • the application describes multiple and varied embodiments and implementations.
  • the following section describes an example framework that is suitable for practicing various implementations.
  • the application describes example systems, devices, and processes for implementing an over ⁇ parameterization system.
  • FIG. 1 illustrates an example environment 100 usable to implement an over ⁇ parameterization system.
  • the environment 100 may include an over ⁇ parameterization system 102.
  • the over ⁇ parameterization system 102 may include a plurality of servers 104 ⁇ 1, 104 ⁇ 2, ..., 104 ⁇ N (which are collectively called as servers 104) .
  • the servers 104 may communicate data with one another via a network 106.
  • each of the servers 104 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.
  • a desktop computer e.g., a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.
  • the network 106 may be a wireless or a wired network, or a combination thereof.
  • the network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof.
  • Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) .
  • Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g., Zigbee, etc. ) , etc.
  • the environment 100 may further include a client device 110.
  • the client device 108 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.
  • the over ⁇ parameterization system 102 may receive a request for training a neural network (such as a convolutional neural network) from the client device 108. The over ⁇ parameterization system 102 may then perform training of the neural network according to the request from the client device 108.
  • a neural network such as a convolutional neural network
  • FIG. 2A illustrates the over ⁇ parameterization system 102 in more detail.
  • the over ⁇ parameterization system 102 may include, but is not limited to, one or more processors 202, an input/output (I/O) interface 204, and/or a network interface 206, and memory 208.
  • some of the functions of the over ⁇ parameterization system 102 may be implemented using hardware, for example, an ASIC (i.e., Application ⁇ Specific Integrated Circuit) , a FPGA (i.e., Field ⁇ Programmable Gate Array) , and/or other hardware.
  • ASIC i.e., Application ⁇ Specific Integrated Circuit
  • FPGA i.e., Field ⁇ Programmable Gate Array
  • the processors 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the I/O interface 204, and/or the network interface 206.
  • the processors 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application ⁇ specific instruction ⁇ set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, a neural processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components.
  • FPGAs field ⁇ programmable gate arrays
  • ASICs application ⁇ specific integrated circuits
  • ASSPs application ⁇ specific standard products
  • SOCs system ⁇ on ⁇ a ⁇ chip systems
  • CPLDs complex programmable logic devices
  • the memory 208 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non ⁇ volatile memory, such as read only memory (ROM) or flash RAM.
  • RAM Random Access Memory
  • ROM read only memory
  • flash RAM flash random access memory
  • the computer readable media may include a volatile or non ⁇ volatile type, a removable or non ⁇ removable media, which may achieve storage of information using any method or technology.
  • the information may include a computer readable instruction, a data structure, a program module or other data.
  • Examples of computer readable media include, but not limited to, phase ⁇ change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random ⁇ access memory (RAM) , read ⁇ only memory (ROM) , electronically erasable programmable read ⁇ only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read ⁇ only memory (CD ⁇ ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non ⁇ transmission media, which may be used to store information that may be accessed by a computing device.
  • the computer readable media does not include any transitory media, such as modulated data signals and carrier waves.
  • the over ⁇ parameterization system 102 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 208 for performing various operations.
  • the over ⁇ parameterization system 102 may further include a parameter database 210 for storing parameter data of one or more machine learning models (such as neural network models, etc. ) , a training database 212 for storing training data, and other program data 214.
  • FIG. 2B illustrates an example neural network processing architecture 216 that can be used for implementing the over ⁇ parameterization system 102.
  • the neural network processing architecture 216 (such as the architecture used for a neural processing unit) may include a heterogeneous computation unit (HCU) 218, a host unit 220, and a host memory 222.
  • the heterogeneous computation unit 218 may include a special ⁇ purpose computing device or hardware used for facilitating and performing neural network computing tasks.
  • the heterogeneous computation unit 218 may perform algorithmic operations including operations associated with machine learning algorithms.
  • the heterogeneous computation unit 218 may be an accelerator, which may include, but is not limited to, a neural network processing unit (NPU) , a graphic processing unit (GPU) , a tensor processing unit (TPU) , a microprocessor, an application ⁇ specific instruction ⁇ set processor, a physics processing unit (PPU) , a digital signal processor, etc.
  • NPU neural network processing unit
  • GPU graphic processing unit
  • TPU tensor processing unit
  • microprocessor an application ⁇ specific instruction ⁇ set processor
  • PPU physics processing unit
  • digital signal processor etc.
  • the heterogeneous computation unit 218 may include one or more computing units 224, a memory hierarchy 226, a controller 228, and an interconnect unit 230.
  • the computing unit 224 may access the memory hierarchy 226 to read and write data in the memory hierarchy 226, and may further perform operations, such as arithmetic operations (e.g., multiplication, addition, multiply ⁇ accumulate, etc. ) on the data.
  • the computing unit 224 may further include a plurality of engines that are configured to perform various types of operations.
  • the computing unit 224 may include a scalar engine 232 and a vector engine 234.
  • the scalar engine 232 may perform scalar operations such as scalar product, convolution, etc.
  • the vector engine 234 may perform vector operations such as vector addition, vector product, etc.
  • the memory hierarchy 226 may include an on ⁇ chip memory (such as 4 blocks of 8GB second generation of high bandwidth memory (HBM2) ) to serve as a main memory.
  • the memory hierarchy 226 may be configured to store data and executable instructions, and allow other components of the neural network processing architecture 216 (e.g., the heterogeneous computation unit (HCU) 218, and the host unit 220) , the heterogeneous computation unit 218 (e.g., the computing units 224 and the interconnect unit 230) , and/or a device external to the neural network processing architecture 216 to access the stored data and/or the stored instructions with high speed, for example.
  • HCU heterogeneous computation unit
  • the heterogeneous computation unit 218 e.g., the computing units 224 and the interconnect unit 230
  • the interconnect unit 230 may provide or facilitate communications of data and/or instructions between the heterogeneous computation unit 218 and other devices or units (e.g., the host unit 220, one or more other HCU (s) ) that are external to the heterogeneous computation unit 218.
  • the interconnect unit 230 may include a peripheral component interconnect express (PCIe) interface 236 and an inter ⁇ chip connection 238.
  • the PCIe interface 236 may provide or facilitate communications of data and/or instructions between the heterogeneous computation unit 218 and the host unit 220.
  • the inter ⁇ chip connection 238 may serve as an inter ⁇ chip bus to connect the heterogeneous computation unit 218 with other devices, such as other HCUs, an off ⁇ chip memory, and/or peripheral devices.
  • the controller 228 may be configured to control and coordinate operations of other components included in the heterogeneous computation unit 218.
  • the controller 228 may control and coordinate different components in the heterogeneous computation unit 218 (such as the scalar engine 232, the vector engine 234, and/or the interconnect unit 230) to facilitate parallelism or synchronization among these components.
  • the host memory 222 may be an off ⁇ chip memory such as a memory of one or more processing units of a host system or device that includes the neural network processing architecture 216.
  • the host memory 222 may include a DDR memory (e.g., DDR SDRAM) or the like, and may be configured to store a large amount of data with slower access speed, as compared to an on ⁇ chip memory that is integrated within the one or more processing units, to act as a higher ⁇ level cache.
  • DDR memory e.g., DDR SDRAM
  • the host unit 220 may include one or more processing units (e.g., an X86 central processing unit (CPU) ) .
  • the host system or device having the host unit 220 and the host memory 222 may further include a compiler (not shown) .
  • the compiler may be a program or computer software configured to convert computer codes written in a certain programming language into instructions that are readable and executable by the heterogeneous computation unit 218. In machine learning applications, the compiler may perform a variety of operations, which may include, but are not limited to, pre ⁇ processing, lexical analysis, parsing, semantic analysis, conversion of an input program to an intermediate representation, code optimization, and code generation, or any combination thereof.
  • FIG. 2C illustrates an example cloud system 240 that incorporates the neural network processing architecture 216 to implement the over ⁇ parameterization system 102.
  • the cloud system 240 may provide cloud services with machine learning and artificial intelligence (AI) capabilities, and may include a plurality of servers, e.g., servers 242 ⁇ 1, 242 ⁇ 2, and 242 ⁇ K (which are collectively called as servers 242) , where K is a positive integer.
  • one or more of the servers 242 may include the neural network processing architecture 216.
  • the cloud system 240 may provide part or all of the functionalities of the over ⁇ parameterization system 102, and other machine learning and artificial intelligence capabilities such as image recognition, facial recognition, translations, 3D modeling, etc.
  • the neural network processing architecture 216 that provides some or all of the functionalities of the over ⁇ parameterization system 102 may be deployed in other types of computing devices, which may include, but are not limited to, a mobile device, a tablet computer, a wearable device, a desktop computer, etc.
  • FIG. 3 illustrates an example neural network model 300 (or simply a neural network 300) in which an over ⁇ parameterization may be performed.
  • a convolutional neural network model is described in this example, the present disclosure may also be applicable to other types of neural network models that involve convolution and/or neural network models having more or fewer types of layers that will be described hereinafter.
  • the neural network 300 may include a plurality of building blocks or distinct types of layers.
  • the plurality of building blocks or distinct types of layers may include, but are not limited to, one or more convolutional layers 302 (only one is shown for the sake of simplicity) , one or more pooling layers 304 (only one is shown for the sake of simplicity) , and a plurality of fully connected layers 306 ⁇ 1, ..., 306 ⁇ S (or collectively called as fully connected layers 306) , where S is an integer greater than one.
  • an input and an output of a layer may be depicted as a feature map, which may be a tensor in R H ⁇ W ⁇ C , where H, W, and C represent a height, a width, and a number of channels of the feature map. Dimensions of the height and the width may define a resolution of the feature map, and may be referred to as spatial dimensions.
  • an input feature map may be denoted as a tensor where H in , W in , and C in represent a height, a width, and a number of channels of the input feature map.
  • An output feature map may be denoted as a tensor where H out , W out , and C out represent a height, a width, and a number of channels of the output feature map.
  • a layer of the neural network may be defined as a matrix which converts the input tensor to the input tensor and such matrix may be denoted as a parameter matrix.
  • the fully connected layer 306 may connect each element in an input tensor to each element in an output tensor This may be achieved through a parameter matrix More specifically, where and are reshaped and respectively.
  • a tensor can be reshaped without the content thereof being changed.
  • a computation of a fully connected layer may be a matrix ⁇ vector product.
  • a hyper ⁇ parameter involved in a fully connected layer is the number of output elements, i.e., H out ⁇ W out ⁇ C out .
  • the plurality of fully connected layers 306 may be applied at the end of the neural network 300, after a size (e.g., a spatial resolution) of an input feature map to a first fully connected layer of the plurality of fully connected layers 306 is significantly reduced from an initial value (e.g., an original size or resolution of an image in an image recognition application, etc. ) at the beginning of the neural network 300.
  • a size e.g., a spatial resolution
  • the one or more convolutional layers 302 may be applied at the beginning of the neural network 300.
  • a convolutional layer 302 instead of connecting each element in an input tensor to each element in an output tensor patches of may be connected to elements in by a convolution operator. For instance, if a patch of a spatial size of M ⁇ N that is sampled from a spatial location (h in , w in ) of is denoted as and a parameter matrix of a convolutional layer is denoted as an output of the convolutional layer may be depicted as:
  • each element in may be connected to one patch in and thus a size of the parameter matrix may not depend on a spatial resolution of an input feature map of the convolutional layer, and may depend on the size of the patch and input and output channels.
  • a parameter matrix of a convolutional layer is much smaller, thus reducing the number of parameters to be trained or optimized.
  • hyper ⁇ parameters involved in a convolutional layer may include a patch size or kernel size M ⁇ N, a dilation rate a ⁇ b, a stride s ⁇ t, and the number of output channels c out .
  • the patch size or kernel size may define a shape and a size of a patch, whereas the dilation size may refer to a gap between elements that are extracted from an input feature map to form a patch.
  • a combination of the kernel size and the dilation rate may define a receptive field of the convolution layer, i.e., (M+ (M-1) ⁇ (a-1) ) ⁇ (N+ (N-1) ⁇ (b-1) ) , which shows that a larger receptive field can be obtained by increasing the kernel size and/or the dilation rate.
  • the stride may define a gap between spatial locations from where patches are sampled.
  • the stride may define a mapping between (h out , w out ) and (h in , w in ) , i.e., to which patch each element in an output feature map is connected.
  • a convolutional layer having a stride that is larger than 1 1 may result in a lower spatial resolution of an output feature map of the convolutional layer as compared to that of an input feature map of the convolutional layer.
  • the pooling layer 304 may be configured to gradually reduce a size of a representation space and thus to reduce the number of parameters and an amount of computations in the neural network 300, thereby controlling or avoiding over ⁇ fitting.
  • the pooling layer 304 may be a parameter ⁇ free layer, and perform a calculation of a predefined or fixed function on its input according to a type of the pooling layer 304. For example, a max ⁇ pooling layer may perform a maximum operation on its input, and an average ⁇ pooling layer may perform an average (AVG) operation on its input.
  • the pooling layer 304 may operate on each feature map independently.
  • the neural network 300 may further include one or more activation layers 308 (or called as one or more rectified linear unit layers 308) to enable the neural network 300 to model such non ⁇ linear problems (FIG. 3 only shows one activation layer is shown for the sake of simplicity) .
  • each activation layer 308 may apply a non ⁇ linear activation function on its input to achieve a non ⁇ linear input ⁇ to ⁇ output conversion.
  • the neural network 300 may further include one or more batch normalization layers 310 (FIG. 3 only shows one batch normalization layer is shown for the sake of simplicity) .
  • Each batch normalization layer 310 may be configured to normalize an output of a previous layer of the neural network 300 by subtracting a batch mean and dividing the output by a batch standard deviation, thus increasing the stability of the neural network 300.
  • the batch normalization layer 310 may add two trainable parameters (a standard deviation parameter and a mean parameter) to each layer, so that a normalized output may be multiplied by the standard deviation parameter and added with the mean parameter.
  • the batch normalization layer 310 may be configured to target on normalizing a distribution of each element of an input (e.g., an input vector) across batch data, and reduce over ⁇ fitting by regularization effects which can alleviate internal covariate shift problems.
  • an input e.g., an input vector
  • the over ⁇ parameterization system 102 may perform training of the neural network 300 to optimize model parameters of the neural network 300 in order to achieve a desired performance (e.g., a high accuracy of inference or prediction) of the neural network 300.
  • the training of the neural network 300 may include a process of finding desired or optimal parameters of the neural network 300 in a predefined hypothesis space to obtain a desired or optimal performance.
  • the over ⁇ parameterization system 102 may select an initialization method, and initiate the parameters of the neural network 300 according to the selected initialization method.
  • the initialization method may include, but is not limited to, initializing the parameters of the neural network 300 with constants (e.g., zero, ones, specified constant) , initializing the parameters of the neural network 300 with random values from a predefined distribution (e.g., a normal distribution, a uniform distribution, etc. ) , initializing the parameters of the neural network 300 based on a specified initialization (such as Xavier initialization, a He initialization, etc. ) , etc.
  • constants e.g., zero, ones, specified constant
  • a predefined distribution e.g., a normal distribution, a uniform distribution, etc.
  • initializing the parameters of the neural network 300 based on a specified initialization such as Xavier initialization, a He initialization, etc.
  • the over ⁇ parameterization system 102 may perform forward propagation or feed ⁇ forward propagation to pass training inputs (such as a plurality of training images with known objects in image or object recognition, for example) to the neural network 300 and obtain respective estimated outputs from the neural network 300 straightforwardly.
  • the over ⁇ parameterization system 102 may then compute the performance of the neural network 300 based on a loss function or an error function (e.g., an accuracy of the neural network 300 based on a comparison between the estimated outputs and the corresponding known objects of the plurality of training images in the image or object recognition in this example) .
  • the over ⁇ parameterization system 102 may compute a derivative of the loss function, for example, to determine error information of the training inputs that is obtained under current values of the parameters of the neural network 300.
  • the over ⁇ parameterization system 102 may perform backward propagation or back ⁇ propagation to propagate the error information backward through the neural network 300, and adjust or update the values of the parameters of the neural network 300 according to a gradient descent algorithm, for example.
  • the over ⁇ parameterization system 102 may continue to iterate the foregoing operations (i.e., from performing the forward propagation to adjusting or updating the values of the parameters of the neural network 300) until the values of the parameters of the neural network 300 converge, or until a predefined number of iterations is reached.
  • the over ⁇ parameterization system 102 may allow the neural network 300 to perform inferences or predictions for new inputs, e.g., new images with objects to be classified, for example.
  • the over ⁇ parameterization system 102 may represent a single parameter matrix of a certain layer (for example, a convolutional layer, or a fully connected layer, etc. ) of a neural network as a multiplication of two parameter matrices, e.g., and In other words, or denotes a vanilla matrix of the layer, and has a same shape (i.e., a same number of rows and a same number of columns) as that of The other parameter matrix is an over ⁇ parameterization matrix.
  • an over ⁇ parameterization matrix may be a left ⁇ multiplication matrix or a right ⁇ multiplication matrix. Both and represent the same underlying linear transformation, transforming to is considered as over ⁇ parameterization because the total number of parameters included in and is apparently more than that of
  • a parameter matrix may be reshaped from a tensor in
  • different channels of a parameter matrix of a convolutional layer may be over ⁇ parameterized, leading to different types of channel ⁇ wise over ⁇ parameterization, such as an in ⁇ channel ⁇ wise over ⁇ parameterization, an out ⁇ channel ⁇ wise over ⁇ parameterization, an all ⁇ channel ⁇ wise over ⁇ parameterization, for example.
  • matrices (e.g., and ) involved in over ⁇ parameterization may be optimized simultaneously in a training phase of the neural network, and may be combined together into a single parameter matrix (e.g., ) after the training.
  • a single parameter matrix e.g., ) after the training.
  • over ⁇ parameterization does not increase any amount of computations in the inference phase.
  • the over ⁇ parameterization system 102 may perform a variety of different types of over ⁇ parameterization for a layer (a fully connected layer or a convolutional layer) of the neural network.
  • the variety of different types of over ⁇ parameterization may include, but is not limited to, a full ⁇ row over ⁇ parameterization, a full ⁇ column over ⁇ parameterization, a channel ⁇ wise over ⁇ parameterization (which may include an in ⁇ channel ⁇ wise over ⁇ parameterization, an out ⁇ channel ⁇ wise over ⁇ parameterization, an all ⁇ channel ⁇ wise over ⁇ parameterization, for example) , a depth ⁇ wise over ⁇ parameterization, etc.
  • the full ⁇ row over ⁇ parameterization may include an over ⁇ parameterization operating on an entire row of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e.,
  • the full ⁇ column over ⁇ parameterization may include an over ⁇ parameterization operating on an entire column of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e.,
  • the in ⁇ channel ⁇ wise over ⁇ parameterization may include an over ⁇ parameterization that operates only on a C in channel part of a parameter matrix (with dimension of C out ⁇ (M ⁇ N ⁇ C in ) ) of a convolutional layer.
  • the in ⁇ channel ⁇ wise over ⁇ parameterization may be expressed as follows:
  • the out ⁇ channel ⁇ wise over ⁇ parameterization may include an over ⁇ parameterization that operates only on a C out channel part of a parameter matrix (with dimension of C out ⁇ (M ⁇ N ⁇ C in ) ) of a convolutional layer.
  • the out ⁇ channel ⁇ wise over ⁇ parameterization may be expressed as follows:
  • a parameter matrix of a convolutional layer may be over ⁇ parameterized by more than one over ⁇ parameterization matrix.
  • the all ⁇ channel ⁇ wise over ⁇ parameterization may include an over ⁇ parameterization that operates on both C in and C out channel parts of a parameter matrix (with dimension of C out ⁇ (M ⁇ N ⁇ C in ) ) of a convolutional layer.
  • the all ⁇ channel ⁇ wise over ⁇ parameterization may be expressed as follows:
  • the depth ⁇ wise over ⁇ parameterization may include an over ⁇ parameterization that operates on spatial dimensions (i.e., within a channel) of a parameter matrix of a convolutional layer.
  • the depth ⁇ wise over ⁇ parameterization may employ a same over ⁇ parameterization matrix for each input channel, i.e., is the same for each input channel.
  • a parameter matrix for a convolutional layer may be over ⁇ parameterized on a part that corresponds to spatial dimensions of input patches i.e., M ⁇ N dimensions, with and as follows:
  • the depth ⁇ wise over ⁇ parameterization may employ different over ⁇ parameterization matrices for different input channels, i.e., an over ⁇ parameterization matrix (e.g., ) of one channel may be different from an over ⁇ parameterization matrix (e.g., ) of another channel.
  • an over ⁇ parameterization matrix e.g., a parameter matrix for a convolutional layer may be over ⁇ parameterized with and as follows:
  • the independent depth ⁇ wise over ⁇ parameterization may employ different over ⁇ parameterization matrices for at least two different input channels.
  • a convolutional layer of a neural network may process the input feature map in a sliding window manner, which may include apply a set of convolution kernels to a patch having a same size as that of the convolution kernel at each window position.
  • a patch is denoted as a 2 ⁇ dimensional tensor trainable kernels of a convolutional layer may be represented as a 3 ⁇ dimensional tensor where M and N are spatial dimensions of the patch, C in is the number of channels in the input feature map of the convolutional layer, and C out is the number of channels in an output feature map of the convolutional layer.
  • FIG. 4 illustrates an example normal convolution.
  • An output of a normal convolution operator (which is represented as *) may be a C out ⁇ dimensional feature
  • each of C in channels of the input patch tensor may be involved in D mul separate or individual dot products.
  • each input patch channel i.e., a M ⁇ N ⁇ dimensional feature
  • D mul is called as a depth multiplier herein.
  • FIG. 5 illustrates an example depth ⁇ wise convolution.
  • a trainable depth ⁇ wise convolution kernel may be represented as a 3 ⁇ dimensional tensor Since each input channel may be converted into a D mul ⁇ dimensional feature, an output of a depth ⁇ wise convolution operator (which is represented as ⁇ ) may be a D mul ⁇ C in ⁇ dimensional feature
  • a convolutional layer that is over ⁇ parameterized with a depth ⁇ wise parameterization may be a composition of a depth ⁇ wise convolution with trainable kernel and a normal convolution with trainable kernel where D mul ⁇ M ⁇ N .
  • an output of an depth ⁇ wise over ⁇ parameterization operator (which is represented as ) may be the same as that of a convolutional layer, a C out ⁇ dimensional feature
  • FIGS. 6A and 6B illustrate two example ways of a depth ⁇ wise over ⁇ parameterization operator. As shown in FIGS. 6A and 6B, the depth ⁇ wise over ⁇ parameterization operator may be applied in two mathematically equivalent ways as follows:
  • the first manner i.e., ) is called a feature composition as shown in FIG. 6A, and involves first applying the trainable kernel to the input patch by a depth ⁇ wise convolution operator ⁇ to obtain a transformed feature and then applying the trainable kernel to the transformed feature by a normal convolution operator *to obtain
  • the second manner i.e., ) is called a kernel composition as shown in FIG. 6B, and involves first applying the trainable kernel to transform by a depth ⁇ wise convolution operator ⁇ to obtain and then applying a normal convolution operator *between and to obtain
  • a receptive field of a depth ⁇ wise over ⁇ parameterized convolutional layer is M ⁇ N, and an interface of a depth ⁇ wise over ⁇ parameterized convolutional layer is the same as an interface of a normal convolution layer. Therefore, a depth ⁇ wise over ⁇ parameterized convolutional layer may easily replace a normal convolution layer in a neural network. Since a depth ⁇ wise over ⁇ parameterization operator is differentiable, both and of a depth ⁇ wise over ⁇ parameterized convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers.
  • the feature composition and the kernel composition used for performing the depth ⁇ wise over ⁇ parameterization operator may lead to different training efficiencies of a depth ⁇ wise over ⁇ parameterized convolutional layer and hence a neural network that is involved.
  • MACC multiply and accumulate operations
  • Kernel composition
  • the MACC costs for the feature composition and the kernel composition depend on values of hyper ⁇ parameters that are involved. Since H ⁇ W>>C out and D mul >>M ⁇ N, the kernel composition may generally incurs fewer MACC operations as compared to the feature composition, and an amount of memory consumed by in the kernel composition may normally be smaller than that consumed by in the feature composition. Therefore, the kernel composition may be selected for performing the depth ⁇ wise over ⁇ parameterization operator when training the neural network.
  • the depth ⁇ wise over ⁇ parameterization may further be allowed to apply over a depth ⁇ wise convolution, which leads to a depth ⁇ wise over ⁇ parameterized depth ⁇ wise convolutional layer.
  • FIGS. 7A and 7B illustrate example ways of obtaining a depth ⁇ wise over ⁇ parameterized depth ⁇ wise convolutional layer.
  • an operator (which is represented as ) of depth ⁇ wise over ⁇ parameterization over depth ⁇ wise convolution may be obtained or computed in two mathematically equivalent ways as follows:
  • training of a neural network including a depth ⁇ wise over ⁇ parameterized depth ⁇ wise convolutional layer may be similar to training of a neural network including a depth ⁇ wise over ⁇ parameterized convolutional layer, and both and of the depth ⁇ wise over ⁇ parameterized depth ⁇ wise convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers. After training of the neural network is completed, and may be combined to obtain a single matrix which may then be used for making inferences.
  • a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers.
  • FIG. 8 shows a schematic diagram depicting an example method of over ⁇ parameterization.
  • the method of FIG. 8 may, but need not, be implemented in the environment of FIG. 1 and using the system of FIG. 2, with reference to the neural network of FIG. 3, and the convolutions of FIGS. 4 ⁇ 7.
  • method 800 is described with reference to FIGS. 1 ⁇ 7. However, the method 800 may alternatively be implemented in other environments and/or using other systems.
  • the method 800 is described in the general context of computer ⁇ executable instructions.
  • computer ⁇ executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types.
  • each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof.
  • the order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein.
  • the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations.
  • some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.
  • ASICs application specific integrated circuits
  • the over ⁇ parameterization system 102 may obtain information of a neural network that includes one or more convolutional layers and one or more other layers.
  • the over ⁇ parameterization system 102 may receive or obtain information of a neural network to be trained from a database (such as the parameter database 210) , or a client device (such as the client device 108) .
  • the information of the neural network may include, but is not limited to, a type of the neural network, hyper ⁇ parameters of the neural network, initial values of the hyper ⁇ parameters of the neural network, trainable parameters of the neural network, a structure (such as the number of layers, types of layers, etc. ) , etc.
  • the over ⁇ parameterization system 102 may initialize the trainable parameters of the neural network randomly or based on a priori knowledge.
  • the over ⁇ parameterization system 102 or the database 210 may store information of one or more trained neural networks that are similar to the neural network to be trained.
  • the over ⁇ parameterization system 102 may initialize the trainable parameters of the neural network based at least in part on the information of the one or more trained neural networks that are similar to the neural network to be trained.
  • the neural network may include, but is not limited to, one or more convolutional layers, a plurality of fully connected layers, one or more pooling layers, one or more activation layers, one or more batch normalization layers, etc.
  • Examples of the neural network may include, but are not limited to, a convolutional neural network, or any neural networks having one or more convolutional layers, etc.
  • the over ⁇ parameterization system 102 may perform depth ⁇ wise over ⁇ parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth ⁇ wise over ⁇ parameterization convolutional layer.
  • the over ⁇ parameterization system 102 may perform depth ⁇ wise over ⁇ parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth ⁇ wise over ⁇ parameterization convolutional layer.
  • a number of parameters that are associated with the depth ⁇ wise over ⁇ parameterization convolutional layer and are trainable is higher as compared to a number of parameters of the at least one convolutional layer.
  • the over ⁇ parameterization system 102 may transform a parameter matrix associated with the at least one convolutional layer into at least two separate matrices, and associate the at least two separate matrices with the depth ⁇ wise over ⁇ parameterization convolutional layer.
  • the parameter matrix associated with the at least one convolutional layer may include a plurality of channels representing a color space.
  • the over ⁇ parameterization system 102 may perform over ⁇ parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices according to a depth ⁇ wise over ⁇ parameterization as described in the foregoing description.
  • the over ⁇ parameterization system 102 may perform the over ⁇ parameterization independently on different channels of the plurality of channels to obtain different over ⁇ parameterization matrices for the different channels. In implementations, the over ⁇ parameterization system 102 may perform an identical over ⁇ parameterization on the different channels to obtain a same over ⁇ parameterization matrix for the different channels.
  • over ⁇ parameterization system 102 may further perform channel ⁇ wise over ⁇ parameterization (such as in ⁇ channel ⁇ wise over ⁇ parameterization, out ⁇ channel ⁇ wise over ⁇ parameterization, or all ⁇ channel ⁇ wise over ⁇ parameterization, etc. ) on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
  • channel ⁇ wise over ⁇ parameterization such as in ⁇ channel ⁇ wise over ⁇ parameterization, out ⁇ channel ⁇ wise over ⁇ parameterization, or all ⁇ channel ⁇ wise over ⁇ parameterization, etc.
  • the over ⁇ parameterization system 102 may train the neural network using training data according to a training method.
  • the over ⁇ parameterization system 102 may obtain training data, and train the neural network using training data according to a training method.
  • the over ⁇ parameterization system 102 may obtain training data from the training database 212, or from a designated storage location indicated or provided by the client device.
  • different training data may be used. Examples of the application may include, but are not limited to, an image classification, an object detection, or a semantic segmentation, etc.
  • the training data may include a plurality of images (which may be color images and/or grayscale images) with known results (e.g., known information of the images, such as respective classes of objects in the images, etc. ) .
  • the training method may include a variety of training or learning algorithms that may be used for training neural networks.
  • Examples of the training method may include, but are not limited, to a backward propagation algorithm, a gradient descent algorithm, or a combination thereof, etc.
  • the speed of convergence for obtaining optimal or desired trainable parameters of the neural network is actually higher, thus increasing the speed training of the neural network.
  • initial values of hyper ⁇ parameters of a neural network are the same, the accuracy of a neural network that is trained using a depth ⁇ wise over ⁇ parameterization is found to be higher as compared to the accuracy of a trained neural network without using the depth ⁇ wise over ⁇ parameterization.
  • the over ⁇ parameterization system 102 may selectively combine the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain a new convolutional layer, and replacing the depth ⁇ wise over ⁇ parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  • the over ⁇ parameterization system 102 may selectively combine the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain a new convolutional layer, and replacing the depth ⁇ wise over ⁇ parameterization layer by the new convolutional layer in the trained neural network.
  • the over ⁇ parameterization system 102 may combine the at least two separate matrices associated with the depth ⁇ wise over ⁇ parameterization convolutional layer into a single parameter matrix, and associate the single parameter matrix with the new convolutional layer. Since the at least two matrices are combined into a single matrix, the trained neural network that is obtained after such combination has lower computation and memory costs, and avoid extra computation and memory costs when the trained neural network is used for performing inferences or predictions in the intended application.
  • some or all of the above method blocks may be implemented or performed by one or more specific processing units of the over ⁇ parameterization system 102.
  • the over ⁇ parameterization system 102 may employ a tensor processing unit, a graphics processing unit, and/or a neural processing unit to perform tensor and matrix computations, thus further improving the performance of the over ⁇ parameterization system 102, and improving the speed of training the neural network in a training phase.
  • the over ⁇ parameterization system 102 may further employ such specific processing units to perform tensor and matrix computations involved in making inferences or predictions by the trained neural network in an inference phase.
  • a method implemented by one or more computing devices comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth ⁇ wise over ⁇ parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth ⁇ wise over ⁇ parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain a new convolutional layer, and replacing the depth ⁇ wise over ⁇ parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  • Clause 2 The method of Clause 1, wherein performing the depth ⁇ wise over ⁇ parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth ⁇ wise over ⁇ parameterization convolutional layer.
  • Clause 3 The method of Clause 2, wherein selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.
  • Clause 4 The method of Clause 1, wherein performing the depth ⁇ wise over ⁇ parameterization comprises performing over ⁇ parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
  • Clause 5 The method of Clause 1, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over ⁇ parameterization on the spatial dimensions of the parameter matrix comprises: performing the over ⁇ parameterization independently on different channels of the plurality of channels to obtain different over ⁇ parameterization matrices for the different channels; or performing an identical over ⁇ parameterization on the different channels to obtain a same over ⁇ parameterization matrix for the different channels.
  • Clause 6 The method of Clause 1, further comprising performing channel ⁇ wise over ⁇ parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
  • Clause 7 The method of Clause 1, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
  • Clause 8 The method of Clause 1, wherein a number of parameters that are associated with the depth ⁇ wise over ⁇ parameterization convolutional layer and are trainable is higher as compared to the at least one convolutional layer.
  • Clause 9 The method of Clause 1, wherein the specific application comprises an image classification, an object detection, or a semantic segmentation.
  • One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth ⁇ wise over ⁇ parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth ⁇ wise over ⁇ parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain a new convolutional layer, and replacing the depth ⁇ wise over ⁇ parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  • Clause 11 The one or more computer readable media of Clause 11, wherein performing the depth ⁇ wise over ⁇ parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth ⁇ wise over ⁇ parameterization convolutional layer.
  • Clause 12 The one or more computer readable media of Clause 11, wherein selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.
  • Clause 13 The one or more computer readable media of Clause 11, wherein performing the depth ⁇ wise over ⁇ parameterization comprises performing over ⁇ parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
  • Clause 14 The one or more computer readable media of Clause 11, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over ⁇ parameterization on the spatial dimensions of the parameter matrix comprises: performing the over ⁇ parameterization independently on different channels of the plurality of channels to obtain different over ⁇ parameterization matrices for the different channels; or performing an identical over ⁇ parameterization on the different channels to obtain a same over ⁇ parameterization matrix for the different channels.
  • Clause 15 The one or more computer readable media of Clause 11, wherein the acts further comprise performing channel ⁇ wise over ⁇ parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
  • Clause 16 The one or more computer readable media of Clause 11, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
  • a system comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth ⁇ wise over ⁇ parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth ⁇ wise over ⁇ parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain a new convolutional layer, and replacing the depth ⁇ wise over ⁇ parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  • Clause 18 The system of Clause 17, wherein performing the depth ⁇ wise over ⁇ parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth ⁇ wise over ⁇ parameterization convolutional layer.
  • Clause 19 The system of Clause 17, wherein selectively combining the parameters associated with the depth ⁇ wise over ⁇ parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.

Abstract

An over-parameterization may be performed on spatial dimensions of an original parameter matrix of a convolutional layer of a neural network, to convert the original parameter matrix of the convolutional layer into two parameter matrices, with a first parameter matrix and a second parameter matrix. Training of the neural network may then be performed to determine values for learnable parameters of the first parameter matrix and the second parameter matrix. After training is completed, the first parameter matrix and the second parameter matrix may be combined as a single parameter matrix and treated as a single convolutional layer, which may then be used in subsequent inference applications of the neural network.

Description

DEPTH‐WISE OVER‐PARAMETERIZATION BACKGROUND
Convolutional neural networks (CNNs) are a type of deep learning method, which are capable of expressing highly complicated functions. A convolutional neural network may receive an image as an input, assign importance (i.e., learnable weights and biases) to various aspects/objects in the image, and differentiate one aspect/object from other aspects/objects. Accordingly, the convolutional neural networks have been widely used in a variety of computer vision applications, such as image classification, object detection, and semantic segmentation, etc.
A convolutional neural network is made up of a plurality of distinct layers. In general, an accuracy of a convolutional neural network relies heavily on the depth of the network, i.e., the larger the number of layers, the higher the accuracy of the network will be. However, increasing the number of layers in a convolutional neural network may increase the complexity of the network, and lead to a higher amount of computations, an increased cost (in terms of processing and storage resources) , and a longer delay in computations, etc. Furthermore, the increased complexity of the convolutional neural network may increase the time taken for convergence of different variables or parameters of the convolutional neural network (i.e., the time taken for training the convolutional neural network) , and the time taken for computing a result when a prediction or inference is subsequently performed using the convolutional neural network.
SUMMARY
This summary introduces simplified concepts of depth‐wise over‐parameterization, which will be further described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in limiting the scope of the claimed subject matter.
This disclosure describes example implementations of depth‐wise over‐parameterization. In implementations, an over‐parameterization may be performed on spatial dimensions of an original parameter matrix of a convolutional layer of a neural network, to convert the original parameter matrix of the convolutional layer into two parameter matrices, with a first parameter matrix and a second parameter matrix. In implementations, training may then be performed on the neural network to determine values for learnable parameters of the first parameter matrix and the second parameter matrix. After training is completed, the first parameter matrix and the second parameter matrix may be combined as a single parameter matrix and treated as a single convolutional layer, which may then be used in subsequent inference applications of the neural network.
BRIEF DESCRIPTION OF THE DRAWINGS
The detailed description is set forth with reference to the accompanying figures. In the figures, the left‐most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
FIG. 1 illustrates an example environment in which an example over‐parameterization system may be used.
FIG. 2A illustrates the example over‐parameterization system in more detail.
FIG. 2B illustrates an example neural network processing architecture that can be used for implementing the example over‐parameterization system.
FIG. 2C illustrates an example cloud system that incorporates the example neural network processing architecture to implement the example over‐parameterization system.
FIG. 3 illustrates an example neural network in which an over‐parameterization may be performed.
FIG. 4 illustrates an example normal convolution
FIG. 5 illustrates an example depth‐wise convolution.
FIGS. 6A and 6B illustrate example ways of a depth‐wise over‐parameterization operator.
FIGS. 7A and 7B illustrate example ways of obtaining a depth‐wise over‐parameterized depth‐wise convolutional layer.
FIG. 8 illustrates an example method of over‐parameterization.
DETAILED DESCRIPTION
Overview
As noted above, although existing convolutional neural networks can model complicated functions, and are useful in a variety of applications such as  computer vision applications, the accuracies of the existing convolutional neural networks rely heavily on corresponding depths of the neural networks. An existing convolutional neural network is therefore severely suffered from limitations due to tradeoffs between the accuracy and the speed of making an inference or prediction by the convolutional neural network.
This disclosure describes an example over‐parameterization system. In implementations, the over‐parameterization system may virtually convert a convolutional layer of a neural network into at least two linear layers, which are combined back into a single layer after training. In implementations, the over‐parameterization system may perform an over‐parameterization on spatial dimensions of a parameter matrix that represents the convolutional layer to obtain a first parameter matrix and a second parameter matrix, which virtually correspond to the two linear layers. In implementations, the first parameter matrix may constitute an additional layer that is related to depth‐wise over‐parameterization, whereas the second parameter matrix may have a size that is the same as that of the original parameter matrix of the convolutional layer.
In implementations, after obtaining the at least two linear layers, the over‐parameterization system may initiate trainable parameters of the neural network randomly or based on a priori knowledge. Examples of the trainable parameters may include, but are not limited to, additional parameters included in the first parameter matrix, parameters of the second parameter matrix, and model parameters of other layers of the neural network. In implementations, after initiating the trainable parameters of the neural network, the over‐parameterization system  may then perform training of the entire neural network based on a training algorithm (e.g., a gradient descent based optimization) using training data to obtain resulting values of the trainable parameters, i.e., a trained neural network.
In implementations, after obtaining the trained neural network, the over‐parameterization system may combine the at least two linear layers back into the convolutional layer. For example, the over‐parameterization system may linearly combine the first parameter matrix and the second parameter matrix into a single parameter matrix, which represents the convolutional layer. The over‐parameterization system or another system may then employ the trained neural network to perform inferences or predictions according to an intended application (such as an image classification, etc. ) .
As described above, the over‐parameterization system involves increasing the number of trainable parameters associated with a convolutional layer of a neural network by performing an over‐parameterization on a parameter matrix of the convolutional layer to form at least two parameter matrices. Although additional trainable parameters are added, the over‐parameterization (i.e., adding a linear layer) actually helps accelerating the convergence of trainable parameters of the neural network, and thus increases the speed of training. Furthermore, since the at least two parameter matrices are combined into a single parameter matrix (i.e., at least two linear layers represented by the at least two parameter matrices are combined back into a single convolutional layer) after training, an amount of computations associated with such single convolutional layer is equivalent or similar to that of a convolutional layer without the over‐parameterization being performed,  when the trained neural network is employed for performing inferences or predictions according to an intended application (such as an image classification, etc. ) .
In implementations, functions described herein to be performed by the over‐parameterization system may be performed by multiple separate units or services. Moreover, although in the examples described herein, the over‐parameterization system may be implemented as a combination of software and hardware implemented and distributed in multiple devices, in other examples, the over‐parameterization system may be implemented and distributed as services provided in one or more computing devices over a network and/or in a cloud computing architecture.
The application describes multiple and varied embodiments and implementations. The following section describes an example framework that is suitable for practicing various implementations. Next, the application describes example systems, devices, and processes for implementing an over‐parameterization system.
Example Environment
FIG. 1 illustrates an example environment 100 usable to implement an over‐parameterization system. The environment 100 may include an over‐parameterization system 102. In implementations, the over‐parameterization system 102 may include a plurality of servers 104‐1, 104‐2, …, 104‐N (which are collectively called as servers 104) . The servers 104 may communicate data with one  another via a network 106.
In implementations, each of the servers 104 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.
The network 106 may be a wireless or a wired network, or a combination thereof. The network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) . Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g., 
Figure PCTCN2020097221-appb-000001
Zigbee, etc. ) , etc.
In implementations, the environment 100 may further include a client device 110. The client device 108 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.
In implementations, the over‐parameterization system 102 may receive a request for training a neural network (such as a convolutional neural network) from the client device 108. The over‐parameterization system 102 may then perform training of the neural network according to the request from the client device 108.
Example Over‐Parameterization System
FIG. 2A illustrates the over‐parameterization system 102 in more detail. In implementations, the over‐parameterization system 102 may include, but is not limited to, one or more processors 202, an input/output (I/O) interface 204, and/or a network interface 206, and memory 208. In implementations, some of the functions of the over‐parameterization system 102 may be implemented using hardware, for example, an ASIC (i.e., Application‐Specific Integrated Circuit) , a FPGA (i.e., Field‐Programmable Gate Array) , and/or other hardware.
In implementations, the processors 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the I/O interface 204, and/or the network interface 206. In implementations, the processors 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, a neural processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without  limitation, illustrative types of hardware logic components that can be used include field‐programmable gate arrays (FPGAs) , application‐specific integrated circuits (ASICs) , application‐specific standard products (ASSPs) , system‐on‐a‐chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.
The memory 208 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non‐volatile memory, such as read only memory (ROM) or flash RAM. The memory 208 is an example of computer readable media.
The computer readable media may include a volatile or non‐volatile type, a removable or non‐removable media, which may achieve storage of information using any method or technology. The information may include a computer readable instruction, a data structure, a program module or other data. Examples of computer readable media include, but not limited to, phase‐change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random‐access memory (RAM) , read‐only memory (ROM) , electronically erasable programmable read‐only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read‐only memory (CD‐ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non‐transmission media, which may be used to store information that may be accessed by a computing device. As defined herein, the computer readable media does not include any transitory media, such as modulated data signals and carrier waves.
Although in this example, only hardware components are described in  the over‐parameterization system 102, in other instances, the over‐parameterization system 102 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 208 for performing various operations. For example, the over‐parameterization system 102 may further include a parameter database 210 for storing parameter data of one or more machine learning models (such as neural network models, etc. ) , a training database 212 for storing training data, and other program data 214.
FIG. 2B illustrates an example neural network processing architecture 216 that can be used for implementing the over‐parameterization system 102. In implementations, the neural network processing architecture 216 (such as the architecture used for a neural processing unit) may include a heterogeneous computation unit (HCU) 218, a host unit 220, and a host memory 222. The heterogeneous computation unit 218 may include a special‐purpose computing device or hardware used for facilitating and performing neural network computing tasks. By way of example and not limitation, the heterogeneous computation unit 218 may perform algorithmic operations including operations associated with machine learning algorithms. In implementations, the heterogeneous computation unit 218 may be an accelerator, which may include, but is not limited to, a neural network processing unit (NPU) , a graphic processing unit (GPU) , a tensor processing unit (TPU) , a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a digital signal processor, etc.
In implementations, the heterogeneous computation unit 218 may include one or more computing units 224, a memory hierarchy 226, a controller 228, and an interconnect unit 230. The computing unit 224 may access the memory hierarchy 226 to read and write data in the memory hierarchy 226, and may further perform operations, such as arithmetic operations (e.g., multiplication, addition, multiply‐accumulate, etc. ) on the data. In implementations, the computing unit 224 may further include a plurality of engines that are configured to perform various types of operations. By way of example and not limitation, the computing unit 224 may include a scalar engine 232 and a vector engine 234. The scalar engine 232 may perform scalar operations such as scalar product, convolution, etc. The vector engine 234 may perform vector operations such as vector addition, vector product, etc.
In implementations, the memory hierarchy 226 may include an on‐chip memory (such as 4 blocks of 8GB second generation of high bandwidth memory (HBM2) ) to serve as a main memory. The memory hierarchy 226 may be configured to store data and executable instructions, and allow other components of the neural network processing architecture 216 (e.g., the heterogeneous computation unit (HCU) 218, and the host unit 220) , the heterogeneous computation unit 218 (e.g., the computing units 224 and the interconnect unit 230) , and/or a device external to the neural network processing architecture 216 to access the stored data and/or the stored instructions with high speed, for example.
In implementations, the interconnect unit 230 may provide or facilitate communications of data and/or instructions between the heterogeneous  computation unit 218 and other devices or units (e.g., the host unit 220, one or more other HCU (s) ) that are external to the heterogeneous computation unit 218. In implementations, the interconnect unit 230 may include a peripheral component interconnect express (PCIe) interface 236 and an inter‐chip connection 238. The PCIe interface 236 may provide or facilitate communications of data and/or instructions between the heterogeneous computation unit 218 and the host unit 220. The inter‐chip connection 238 may serve as an inter‐chip bus to connect the heterogeneous computation unit 218 with other devices, such as other HCUs, an off‐chip memory, and/or peripheral devices.
In implementations, the controller 228 may be configured to control and coordinate operations of other components included in the heterogeneous computation unit 218. For example, the controller 228 may control and coordinate different components in the heterogeneous computation unit 218 (such as the scalar engine 232, the vector engine 234, and/or the interconnect unit 230) to facilitate parallelism or synchronization among these components.
In implementations, the host memory 222 may be an off‐chip memory such as a memory of one or more processing units of a host system or device that includes the neural network processing architecture 216. In implementations, the host memory 222 may include a DDR memory (e.g., DDR SDRAM) or the like, and may be configured to store a large amount of data with slower access speed, as compared to an on‐chip memory that is integrated within the one or more processing units, to act as a higher‐level cache.
In implementations, the host unit 220 may include one or more processing units (e.g., an X86 central processing unit (CPU) ) . In implementations, the host system or device having the host unit 220 and the host memory 222 may further include a compiler (not shown) . The compiler may be a program or computer software configured to convert computer codes written in a certain programming language into instructions that are readable and executable by the heterogeneous computation unit 218. In machine learning applications, the compiler may perform a variety of operations, which may include, but are not limited to, pre‐processing, lexical analysis, parsing, semantic analysis, conversion of an input program to an intermediate representation, code optimization, and code generation, or any combination thereof.
FIG. 2C illustrates an example cloud system 240 that incorporates the neural network processing architecture 216 to implement the over‐parameterization system 102. The cloud system 240 may provide cloud services with machine learning and artificial intelligence (AI) capabilities, and may include a plurality of servers, e.g., servers 242‐1, 242‐2, and 242‐K (which are collectively called as servers 242) , where K is a positive integer. In implementations, one or more of the servers 242 may include the neural network processing architecture 216. Using the neural network processing architecture 216, the cloud system 240 may provide part or all of the functionalities of the over‐parameterization system 102, and other machine learning and artificial intelligence capabilities such as image recognition, facial recognition, translations, 3D modeling, etc.
In implementations, although the cloud system 240 is described above, in some instances, the neural network processing architecture 216 that provides some or all of the functionalities of the over‐parameterization system 102 may be deployed in other types of computing devices, which may include, but are not limited to, a mobile device, a tablet computer, a wearable device, a desktop computer, etc.
Example Neural Network Model
FIG. 3 illustrates an example neural network model 300 (or simply a neural network 300) in which an over‐parameterization may be performed. Although a convolutional neural network model is described in this example, the present disclosure may also be applicable to other types of neural network models that involve convolution and/or neural network models having more or fewer types of layers that will be described hereinafter.
In implementations, the neural network 300 may include a plurality of building blocks or distinct types of layers. By way of example and not limitation, the plurality of building blocks or distinct types of layers may include, but are not limited to, one or more convolutional layers 302 (only one is shown for the sake of simplicity) , one or more pooling layers 304 (only one is shown for the sake of simplicity) , and a plurality of fully connected layers 306‐1, …, 306‐S (or collectively called as fully connected layers 306) , where S is an integer greater than one.
In implementations, an input and an output of a layer may be depicted as a feature map, which may be a tensor in R H×W×C, where H, W, and C  represent a height, a width, and a number of channels of the feature map. Dimensions of the height and the width may define a resolution of the feature map, and may be referred to as spatial dimensions. In implementations, an input feature map may be denoted as a tensor
Figure PCTCN2020097221-appb-000002
where H in, W in, and C in represent a height, a width, and a number of channels of the input feature map. An output feature map may be denoted as a tensor
Figure PCTCN2020097221-appb-000003
where H out, W out, and C out represent a height, a width, and a number of channels of the output feature map. In implementations, a layer of the neural network may be defined as a matrix which converts the input tensor
Figure PCTCN2020097221-appb-000004
to the input tensor
Figure PCTCN2020097221-appb-000005
and such matrix may be denoted as a parameter matrix.
In implementations, the fully connected layer 306 may connect each element in an input tensor
Figure PCTCN2020097221-appb-000006
to each element in an output tensor
Figure PCTCN2020097221-appb-000007
This may be achieved through a parameter matrix
Figure PCTCN2020097221-appb-000008
More specifically, 
Figure PCTCN2020097221-appb-000009
where
Figure PCTCN2020097221-appb-000010
and
Figure PCTCN2020097221-appb-000011
Figure PCTCN2020097221-appb-000012
are reshaped
Figure PCTCN2020097221-appb-000013
and
Figure PCTCN2020097221-appb-000014
respectively. In implementations, a tensor can be reshaped without the content thereof being changed. For example, an original four‐dimensional tensor
Figure PCTCN2020097221-appb-000015
may be reshaped into another three‐dimensional tensor
Figure PCTCN2020097221-appb-000016
or
Figure PCTCN2020097221-appb-000017
where M=J×K can be deduced because
Figure PCTCN2020097221-appb-000018
and
Figure PCTCN2020097221-appb-000019
have the same number of elements, with
Figure PCTCN2020097221-appb-000020
in the original four‐dimensional tensor
Figure PCTCN2020097221-appb-000021
and
Figure PCTCN2020097221-appb-000022
in the reshaped three‐dimensional tensor
Figure PCTCN2020097221-appb-000023
corresponding to the same element. In implementations, a computation of a fully connected layer may be a matrix‐vector product. In implementations, a hyper‐parameter involved in a fully connected layer is  the number of output elements, i.e., H out×W out×C out.
In implementations, due to the full connection nature of a fully connected layer, a parameter matrix
Figure PCTCN2020097221-appb-000024
for the fully connected layer can be very large, which normally requires a large amount of computing and memory resources. Therefore, the plurality of fully connected layers 306 may be applied at the end of the neural network 300, after a size (e.g., a spatial resolution) of an input feature map to a first fully connected layer of the plurality of fully connected layers 306 is significantly reduced from an initial value (e.g., an original size or resolution of an image in an image recognition application, etc. ) at the beginning of the neural network 300.
In implementations, in order to reduce a size of a parameter matrix for each fully connected layer 306 at the end of the neural network 300, the one or more convolutional layers 302 may be applied at the beginning of the neural network 300. In a convolutional layer 302, instead of connecting each element in an input tensor
Figure PCTCN2020097221-appb-000025
to each element in an output tensor
Figure PCTCN2020097221-appb-000026
patches of
Figure PCTCN2020097221-appb-000027
may be connected to elements in
Figure PCTCN2020097221-appb-000028
by a convolution operator. For instance, if a patch of a spatial size of M×N that is sampled from a spatial location (h in, w in) of
Figure PCTCN2020097221-appb-000029
is denoted as
Figure PCTCN2020097221-appb-000030
and a parameter matrix of a convolutional layer is denoted as
Figure PCTCN2020097221-appb-000031
an output of the convolutional layer may be depicted as: 
Figure PCTCN2020097221-appb-000032
In implementations, each element in
Figure PCTCN2020097221-appb-000033
may be connected to one patch in
Figure PCTCN2020097221-appb-000034
and thus a size of the parameter matrix may not depend on a spatial resolution of an input feature map of the convolutional layer, and may depend on the  size of the patch and input and output channels. As compared with a parameter matrix for a fully connected layer, a parameter matrix of a convolutional layer is much smaller, thus reducing the number of parameters to be trained or optimized.
In implementations, hyper‐parameters involved in a convolutional layer may include a patch size or kernel size M×N, a dilation rate a×b, a stride s×t, and the number of output channels c out. In implementations, the patch size or kernel size may define a shape and a size of a patch, whereas the dilation size may refer to a gap between elements that are extracted from an input feature map to form a patch. A combination of the kernel size and the dilation rate may define a receptive field of the convolution layer, i.e., (M+ (M-1) × (a-1) ) × (N+ (N-1) × (b-1) ) , which shows that a larger receptive field can be obtained by increasing the kernel size and/or the dilation rate. In implementations, the larger the receptive field is, the better the capability of the convolutional layer may be. The stride may define a gap between spatial locations from where patches are sampled. The stride may further define a relationship between H in×W in and H out×W out as H out×W out=H in/s×W in/t, or
Figure PCTCN2020097221-appb-000035
and
Figure PCTCN2020097221-appb-000036
Figure PCTCN2020097221-appb-000037
where P is a padding size. In other words, the stride may define a mapping between (h out, w out) and (h in, w in) , i.e., to which patch each element in an output feature map is connected. In implementations, a convolutional layer having a stride that is larger than 1 1 may result in a lower spatial resolution of an output feature map of the convolutional layer as compared to that of an input feature map of the convolutional layer.
In implementations, the pooling layer 304 may be configured to  gradually reduce a size of a representation space and thus to reduce the number of parameters and an amount of computations in the neural network 300, thereby controlling or avoiding over‐fitting. In implementations, the pooling layer 304 may be a parameter‐free layer, and perform a calculation of a predefined or fixed function on its input according to a type of the pooling layer 304. For example, a max‐pooling layer may perform a maximum operation on its input, and an average‐pooling layer may perform an average (AVG) operation on its input. In implementations, the pooling layer 304 may operate on each feature map independently.
In implementations, due to non‐linear nature of a number of real‐world problems, the neural network 300 may further include one or more activation layers 308 (or called as one or more rectified linear unit layers 308) to enable the neural network 300 to model such non‐linear problems (FIG. 3 only shows one activation layer is shown for the sake of simplicity) . In implementations, each activation layer 308 may apply a non‐linear activation function on its input to achieve a non‐linear input‐to‐output conversion. Examples of the non‐linear activation function that may be used by the activation layer 308 may include, but are not limited to, a sigmoid function (e.g., 
Figure PCTCN2020097221-appb-000038
) , a tanh function (e.g., 
Figure PCTCN2020097221-appb-000039
Figure PCTCN2020097221-appb-000040
) , a relu function (e.g., f (x) =max (0, x) ) , etc.
In implementations, the neural network 300 may further include one or more batch normalization layers 310 (FIG. 3 only shows one batch normalization layer is shown for the sake of simplicity) . Each batch normalization layer 310 may be configured to normalize an output of a previous layer of the neural network 300 by subtracting a batch mean and dividing the output by a batch standard deviation, thus  increasing the stability of the neural network 300. In implementations, the batch normalization layer 310 may add two trainable parameters (a standard deviation parameter and a mean parameter) to each layer, so that a normalized output may be multiplied by the standard deviation parameter and added with the mean parameter.
In implementations, the batch normalization layer 310 may be configured to target on normalizing a distribution of each element of an input (e.g., an input vector) across batch data, and reduce over‐fitting by regularization effects which can alleviate internal covariate shift problems.
In implementations, the over‐parameterization system 102 may perform training of the neural network 300 to optimize model parameters of the neural network 300 in order to achieve a desired performance (e.g., a high accuracy of inference or prediction) of the neural network 300. In implementations, the training of the neural network 300 may include a process of finding desired or optimal parameters of the neural network 300 in a predefined hypothesis space to obtain a desired or optimal performance. In implementations, the over‐parameterization system 102 may select an initialization method, and initiate the parameters of the neural network 300 according to the selected initialization method. By way of example and not limitation, the initialization method may include, but is not limited to, initializing the parameters of the neural network 300 with constants (e.g., zero, ones, specified constant) , initializing the parameters of the neural network 300 with random values from a predefined distribution (e.g., a normal distribution, a uniform distribution, etc. ) , initializing the parameters of the neural network 300 based on a specified initialization (such as Xavier initialization, a  He initialization, etc. ) , etc.
In implementations, after initializing the parameters of the neural network 300, the over‐parameterization system 102 may perform forward propagation or feed‐forward propagation to pass training inputs (such as a plurality of training images with known objects in image or object recognition, for example) to the neural network 300 and obtain respective estimated outputs from the neural network 300 straightforwardly. The over‐parameterization system 102 may then compute the performance of the neural network 300 based on a loss function or an error function (e.g., an accuracy of the neural network 300 based on a comparison between the estimated outputs and the corresponding known objects of the plurality of training images in the image or object recognition in this example) .
In implementations, the over‐parameterization system 102 may compute a derivative of the loss function, for example, to determine error information of the training inputs that is obtained under current values of the parameters of the neural network 300. The over‐parameterization system 102 may perform backward propagation or back‐propagation to propagate the error information backward through the neural network 300, and adjust or update the values of the parameters of the neural network 300 according to a gradient descent algorithm, for example. The over‐parameterization system 102 may continue to iterate the foregoing operations (i.e., from performing the forward propagation to adjusting or updating the values of the parameters of the neural network 300) until the values of the parameters of the neural network 300 converge, or until a predefined number of iterations is reached.
In implementations, after the neural network 300 is trained and desired or optimal values of the parameters of the neural network 300 are obtained, the over‐parameterization system 102 may allow the neural network 300 to perform inferences or predictions for new inputs, e.g., new images with objects to be classified, for example.
Example Types of Over‐Parameterization
In implementations, the over‐parameterization system 102 may represent a single parameter matrix
Figure PCTCN2020097221-appb-000041
of a certain layer (for example, a convolutional layer, or a fully connected layer, etc. ) of a neural network as a multiplication of two parameter matrices, e.g., 
Figure PCTCN2020097221-appb-000042
and
Figure PCTCN2020097221-appb-000043
In other words, 
Figure PCTCN2020097221-appb-000044
or
Figure PCTCN2020097221-appb-000045
denotes a vanilla matrix of the layer, and has a same shape (i.e., a same number of rows and a same number of columns) as that of
Figure PCTCN2020097221-appb-000046
The other parameter matrix
Figure PCTCN2020097221-appb-000047
is an over‐parameterization matrix. In implementations, an over‐parameterization matrix may be a left‐multiplication matrix or a right‐multiplication matrix. Both
Figure PCTCN2020097221-appb-000048
and
Figure PCTCN2020097221-appb-000049
represent the same underlying linear transformation, transforming
Figure PCTCN2020097221-appb-000050
to
Figure PCTCN2020097221-appb-000051
is considered as over‐parameterization because the total number of parameters included in
Figure PCTCN2020097221-appb-000052
and 
Figure PCTCN2020097221-appb-000053
is apparently more than that of
Figure PCTCN2020097221-appb-000054
In implementations, for a convolutional layer of a neural network, a parameter matrix
Figure PCTCN2020097221-appb-000055
may be reshaped from a tensor in 
Figure PCTCN2020097221-appb-000056
In implementations, different channels of a parameter matrix of a convolutional layer may be over‐parameterized, leading to different types of channel‐ wise over‐parameterization, such as an in‐channel‐wise over‐parameterization, an out‐channel‐wise over‐parameterization, an all‐channel‐wise over‐parameterization, for example.
In implementations, in an over‐parameterized layer (such as an over‐parameterized convolutional layer) , matrices (e.g., 
Figure PCTCN2020097221-appb-000057
and
Figure PCTCN2020097221-appb-000058
) involved in over‐parameterization may be optimized simultaneously in a training phase of the neural network, and may be combined together into a single parameter matrix (e.g., 
Figure PCTCN2020097221-appb-000059
) after the training. As a result, 
Figure PCTCN2020097221-appb-000060
instead of
Figure PCTCN2020097221-appb-000061
and
Figure PCTCN2020097221-appb-000062
may be used for performing inferences in an inference phase, thus resulting in a same amount of computations as that of a conventional layer without over‐parameterization. In other words, over‐parameterization does not increase any amount of computations in the inference phase.
In implementations, the over‐parameterization system 102 may perform a variety of different types of over‐parameterization for a layer (a fully connected layer or a convolutional layer) of the neural network. The variety of different types of over‐parameterization may include, but is not limited to, a full‐row over‐parameterization, a full‐column over‐parameterization, a channel‐wise over‐parameterization (which may include an in‐channel‐wise over‐parameterization, an out‐channel‐wise over‐parameterization, an all‐channel‐wise over‐parameterization, for example) , a depth‐wise over‐parameterization, etc.
For example, the full‐row over‐parameterization may include an over‐parameterization operating on an entire row of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e., 
Figure PCTCN2020097221-appb-000063
Figure PCTCN2020097221-appb-000064
The full‐column over‐parameterization may include an over‐parameterization operating on an entire column of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e., 
Figure PCTCN2020097221-appb-000065
For example, the in‐channel‐wise over‐parameterization may include an over‐parameterization that operates only on a C in channel part of a parameter matrix (with dimension of C out× (M×N×C in) ) of a convolutional layer. For example, the in‐channel‐wise over‐parameterization may be expressed as follows:
Figure PCTCN2020097221-appb-000066
where
Figure PCTCN2020097221-appb-000067
is over‐parameterized with
Figure PCTCN2020097221-appb-000068
and
Figure PCTCN2020097221-appb-000069
Figure PCTCN2020097221-appb-000070
In implementations, the out‐channel‐wise over‐parameterization may include an over‐parameterization that operates only on a C out channel part of a parameter matrix (with dimension of C out× (M×N×C in) ) of a convolutional layer. For example, the out‐channel‐wise over‐parameterization may be expressed as follows:
Figure PCTCN2020097221-appb-000071
where
Figure PCTCN2020097221-appb-000072
is over‐parameterized with
Figure PCTCN2020097221-appb-000073
and
Figure PCTCN2020097221-appb-000074
Figure PCTCN2020097221-appb-000075
In implementations, a parameter matrix of a convolutional layer may be over‐parameterized by more than one over‐parameterization matrix. By way of example and not limitation, the all‐channel‐wise over‐parameterization may include an over‐parameterization that operates on both C in and C out channel parts of a  parameter matrix (with dimension of C out× (M×N×C in) ) of a convolutional layer. For example, the all‐channel‐wise over‐parameterization may be expressed as follows:
Figure PCTCN2020097221-appb-000076
where
Figure PCTCN2020097221-appb-000077
is over‐parameterized with an over‐parameterization matrix
Figure PCTCN2020097221-appb-000078
an over‐parameterization matrix
Figure PCTCN2020097221-appb-000079
and a vanilla matrix
Figure PCTCN2020097221-appb-000080
In implementations, the depth‐wise over‐parameterization may include an over‐parameterization that operates on spatial dimensions (i.e., within a channel) of a parameter matrix of a convolutional layer. In implementations, the depth‐wise over‐parameterization may employ a same over‐parameterization matrix for each input channel, i.e., 
Figure PCTCN2020097221-appb-000081
is the same for each input channel. By way of example and not limitation, rather than applying a full‐row or full‐column over‐parameterization (which over‐parameterizes an entire row or column of a parameter matrix of a convolutional layer) or applying a channel‐wise over‐parameterization (which over‐parameterizes channel dimension (s) of the parameter matrix of the convolutional layer) , a parameter matrix for a convolutional layer
Figure PCTCN2020097221-appb-000082
Figure PCTCN2020097221-appb-000083
may be over‐parameterized on a part that corresponds to spatial dimensions of input patches
Figure PCTCN2020097221-appb-000084
i.e., M×N dimensions, with 
Figure PCTCN2020097221-appb-000085
and
Figure PCTCN2020097221-appb-000086
as follows:
Figure PCTCN2020097221-appb-000087
where over‐parameterization is applied within each channel.
Alternatively, the depth‐wise over‐parameterization may employ  different over‐parameterization matrices for different input channels, i.e., an over‐parameterization matrix (e.g., 
Figure PCTCN2020097221-appb-000088
) of one channel may be different from an over‐parameterization matrix (e.g., 
Figure PCTCN2020097221-appb-000089
) of another channel. For example, a parameter matrix for a convolutional layer
Figure PCTCN2020097221-appb-000090
may be over‐parameterized with
Figure PCTCN2020097221-appb-000091
and
Figure PCTCN2020097221-appb-000092
as follows:
Figure PCTCN2020097221-appb-000093
where over‐parameterization is applied within each channel, and different (M×N) × (M×N) matrices are used for different input channels in this case. This type of depth‐wise over‐parameterization may be called an independent depth‐wise over‐parameterization.
Alternatively, in implementations, the independent depth‐wise over‐parameterization may employ different over‐parameterization matrices for at least two different input channels.
Example Depth‐wise Convolutional Layer
In implementations, given an input feature map, a convolutional layer of a neural network may process the input feature map in a sliding window manner, which may include apply a set of convolution kernels to a patch having a same size as that of the convolution kernel at each window position. If a patch is denoted as a 2‐dimensional tensor
Figure PCTCN2020097221-appb-000094
trainable kernels of a convolutional layer may be represented as a 3‐dimensional tensor
Figure PCTCN2020097221-appb-000095
where M and N are spatial dimensions of the patch, C in is the number of channels in the input feature map of the convolutional layer, and C out is the number of channels in an output  feature map of the convolutional layer. In a convolutional layer, dot products may be computed between each of the C out kernels and an entire input patch tensor
Figure PCTCN2020097221-appb-000096
FIG. 4 illustrates an example normal convolution. An output of a normal convolution operator (which is represented as *) may be a C out‐dimensional feature
Figure PCTCN2020097221-appb-000097
Figure PCTCN2020097221-appb-000098
In depth‐wise convolution, each of C in channels of the input patch tensor
Figure PCTCN2020097221-appb-000099
may be involved in D mul separate or individual dot products. In implementations, each input patch channel (i.e., a M×N‐dimensional feature) may be transformed into a D mul‐dimensional feature. For the sake of description, D mul is called as a depth multiplier herein. FIG. 5 illustrates an example depth‐wise convolution. As shown in FIG. 5, a trainable depth‐wise convolution kernel may be represented as a 3‐dimensional tensor
Figure PCTCN2020097221-appb-000100
Since each input channel may be converted into a D mul‐dimensional feature, an output of a depth‐wise convolution operator (which is represented as ο) may be a D mul×C in‐dimensional feature
Figure PCTCN2020097221-appb-000101
Figure PCTCN2020097221-appb-000102
In implementations, a convolutional layer that is over‐parameterized with a depth‐wise parameterization (or simply called as a depth‐wise over‐parameterized convolutional layer) may be a composition of a depth‐wise convolution with trainable kernel
Figure PCTCN2020097221-appb-000103
and a normal convolution with trainable kernel
Figure PCTCN2020097221-appb-000104
where D mul≥M×N . In implementations, given an input patch
Figure PCTCN2020097221-appb-000105
an output of an depth‐wise over‐parameterization operator (which is represented as
Figure PCTCN2020097221-appb-000106
) may be the same as that  of a convolutional layer, a C out‐dimensional feature
Figure PCTCN2020097221-appb-000107
FIGS. 6A and 6B illustrate two example ways of a depth‐wise over‐parameterization operator. As shown in FIGS. 6A and 6B, the depth‐wise over‐parameterization operator may be applied in two mathematically equivalent ways as follows:
Figure PCTCN2020097221-appb-000108
where
Figure PCTCN2020097221-appb-000109
is a transpose of
Figure PCTCN2020097221-appb-000110
on a first axis and a second axis.
In implementations, the first manner (i.e., 
Figure PCTCN2020097221-appb-000111
) is called a feature composition as shown in FIG. 6A, and involves first applying the trainable kernel
Figure PCTCN2020097221-appb-000112
to the input patch
Figure PCTCN2020097221-appb-000113
by a depth‐wise convolution operator ο to obtain a transformed feature
Figure PCTCN2020097221-appb-000114
and then applying the trainable kernel
Figure PCTCN2020097221-appb-000115
to the transformed feature
Figure PCTCN2020097221-appb-000116
by a normal convolution operator *to obtain
Figure PCTCN2020097221-appb-000117
The second manner (i.e., 
Figure PCTCN2020097221-appb-000118
) is called a kernel composition as shown in FIG. 6B, and involves first applying the trainable kernel
Figure PCTCN2020097221-appb-000119
to transform
Figure PCTCN2020097221-appb-000120
by a depth‐wise convolution operator ο to obtain
Figure PCTCN2020097221-appb-000121
and then applying a normal convolution operator *between
Figure PCTCN2020097221-appb-000122
and
Figure PCTCN2020097221-appb-000123
to obtain
Figure PCTCN2020097221-appb-000124
In implementations, a receptive field of a depth‐wise over‐parameterized convolutional layer is M×N, and an interface of a depth‐wise over‐parameterized convolutional layer is the same as an interface of a normal convolution layer. Therefore, a depth‐wise over‐parameterized convolutional layer may easily replace a normal convolution layer in a neural network. Since a depth‐ wise over‐parameterization operator is differentiable, both
Figure PCTCN2020097221-appb-000125
and
Figure PCTCN2020097221-appb-000126
of a depth‐wise over‐parameterized convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers. After training of the neural network is completed, 
Figure PCTCN2020097221-appb-000127
and
Figure PCTCN2020097221-appb-000128
may be combined to obtain
Figure PCTCN2020097221-appb-000129
where this single matrix
Figure PCTCN2020097221-appb-000130
may then be used for making inferences. Since
Figure PCTCN2020097221-appb-000131
has the same shape as that of a kernel of a convolutional layer, computation of the depth‐wise over‐parameterized convolutional layer at an inference phase is the same as that of a normal convolutional layer.
In implementations, the feature composition and the kernel composition used for performing the depth‐wise over‐parameterization operator may lead to different training efficiencies of a depth‐wise over‐parameterized convolutional layer and hence a neural network that is involved. By way of example and not limitation, if the number of multiply and accumulate operations (MACC) is used as a metric for measuring or determining an amount of computations and serves as an efficiency indicator, respective MACC costs for the feature composition and the kernel composition, when being applied on a feature map in
Figure PCTCN2020097221-appb-000132
 (where H and W are the height and the width of the feature map) , may be calculated as follows:
Feature composition:
Figure PCTCN2020097221-appb-000133
Kernel composition:
Figure PCTCN2020097221-appb-000134
As can be seen from the above, the MACC costs for the feature composition and the kernel composition depend on values of hyper‐parameters that are involved. Since H×W>>C out and D mul>>M×N, the kernel composition may generally incurs fewer MACC operations as compared to the feature composition, and an amount of memory consumed by
Figure PCTCN2020097221-appb-000135
in the kernel composition may normally be smaller than that consumed by
Figure PCTCN2020097221-appb-000136
in the feature composition. Therefore, the kernel composition may be selected for performing the depth‐wise over‐parameterization operator when training the neural network.
In implementations, in addition to applying over a normal convolution to obtain a depth‐wise over‐parameterized convolutional layer, the depth‐wise over‐parameterization may further be allowed to apply over a depth‐wise convolution, which leads to a depth‐wise over‐parameterized depth‐wise convolutional layer. Similar to the principles used for obtaining a depth‐wise over‐parameterized convolutional layer, FIGS. 7A and 7B illustrate example ways of obtaining a depth‐wise over‐parameterized depth‐wise convolutional layer. In implementations, an operator (which is represented as
Figure PCTCN2020097221-appb-000137
) of depth‐wise over‐parameterization over depth‐wise convolution may be obtained or computed in two mathematically equivalent ways as follows:
Figure PCTCN2020097221-appb-000138
In implementations, training of a neural network including a depth‐wise over‐parameterized depth‐wise convolutional layer may be similar to training of a neural network including a depth‐wise over‐parameterized convolutional layer, and both
Figure PCTCN2020097221-appb-000139
and
Figure PCTCN2020097221-appb-000140
of the depth‐wise over‐parameterized depth‐wise convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers. After training of the neural network is completed, 
Figure PCTCN2020097221-appb-000141
and
Figure PCTCN2020097221-appb-000142
may be combined to obtain a single matrix
Figure PCTCN2020097221-appb-000143
which may then be used for making inferences.
Example Methods
FIG. 8 shows a schematic diagram depicting an example method of over‐parameterization. The method of FIG. 8 may, but need not, be implemented in the environment of FIG. 1 and using the system of FIG. 2, with reference to the neural network of FIG. 3, and the convolutions of FIGS. 4‐7. For ease of explanation, method 800 is described with reference to FIGS. 1‐7. However, the method 800 may alternatively be implemented in other environments and/or using other systems.
The method 800 is described in the general context of computer‐executable instructions. Generally, computer‐executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The  order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.
Referring back to FIG. 8, at block 802, the over‐parameterization system 102 may obtain information of a neural network that includes one or more convolutional layers and one or more other layers.
In implementations, the over‐parameterization system 102 may receive or obtain information of a neural network to be trained from a database (such as the parameter database 210) , or a client device (such as the client device 108) . In implementations, the information of the neural network may include, but is not limited to, a type of the neural network, hyper‐parameters of the neural network, initial values of the hyper‐parameters of the neural network, trainable parameters of the neural network, a structure (such as the number of layers, types of layers, etc. ) , etc.
In implementations, after receiving or obtaining the information of the neural network, the over‐parameterization system 102 may initialize the trainable parameters of the neural network randomly or based on a priori  knowledge. In implementations, the over‐parameterization system 102 or the database 210 may store information of one or more trained neural networks that are similar to the neural network to be trained. The over‐parameterization system 102 may initialize the trainable parameters of the neural network based at least in part on the information of the one or more trained neural networks that are similar to the neural network to be trained.
In implementations, the neural network may include, but is not limited to, one or more convolutional layers, a plurality of fully connected layers, one or more pooling layers, one or more activation layers, one or more batch normalization layers, etc. In implementations, Examples of the neural network may include, but are not limited to, a convolutional neural network, or any neural networks having one or more convolutional layers, etc.
At bock 804, the over‐parameterization system 102 may perform depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer.
In implementations, the over‐parameterization system 102 may perform depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer. In implementations, a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to a number of parameters of the at least one convolutional layer. By way of example not limitation, the over‐parameterization system 102 may  transform a parameter matrix associated with the at least one convolutional layer into at least two separate matrices, and associate the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer. In implementations, the parameter matrix associated with the at least one convolutional layer may include a plurality of channels representing a color space.
In implementations, the over‐parameterization system 102 may perform over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices according to a depth‐wise over‐parameterization as described in the foregoing description.
Additionally, the over‐parameterization system 102 may perform the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels. In implementations, the over‐parameterization system 102 may perform an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
Additionally, the over‐parameterization system 102 may further perform channel‐wise over‐parameterization (such as in‐channel‐wise over‐parameterization, out‐channel‐wise over‐parameterization, or all‐channel‐wise over‐parameterization, etc. ) on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
At block 806, the over‐parameterization system 102 may train the neural network using training data according to a training method.
After performing the over‐parameterization on the at least one convolutional layer, the over‐parameterization system 102 may obtain training data, and train the neural network using training data according to a training method. In implementations, the over‐parameterization system 102 may obtain training data from the training database 212, or from a designated storage location indicated or provided by the client device. Depending on what application the neural network is intended for use, different training data may be used. Examples of the application may include, but are not limited to, an image classification, an object detection, or a semantic segmentation, etc.
For example, if the neural network is intended for performing inferences or predictions in an image classification application, the training data may include a plurality of images (which may be color images and/or grayscale images) with known results (e.g., known information of the images, such as respective classes of objects in the images, etc. ) .
In implementations, the training method may include a variety of training or learning algorithms that may be used for training neural networks. Examples of the training method may include, but are not limited, to a backward propagation algorithm, a gradient descent algorithm, or a combination thereof, etc.
In implementations, although additional trainable parameters are added to the depth‐wise over‐parameterized convolutional layer, the speed of convergence for obtaining optimal or desired trainable parameters of the neural network is actually higher, thus increasing the speed training of the neural network. Furthermore, given that initial values of hyper‐parameters of a neural network are  the same, the accuracy of a neural network that is trained using a depth‐wise over‐parameterization is found to be higher as compared to the accuracy of a trained neural network without using the depth‐wise over‐parameterization.
At block 808, the over‐parameterization system 102 may selectively combine the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
In implementations, after the neural network is trained, the over‐parameterization system 102 may selectively combine the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network. By way of example and not limitation, the over‐parameterization system 102 may combine the at least two separate matrices associated with the depth‐wise over‐parameterization convolutional layer into a single parameter matrix, and associate the single parameter matrix with the new convolutional layer. Since the at least two matrices are combined into a single matrix, the trained neural network that is obtained after such combination has lower computation and memory costs, and avoid extra computation and memory costs when the trained neural network is used for performing inferences or predictions in the intended application.
For the sake of simplicity, a detailed description of operations and  algorithms (such as training a neural network, operations associated with a depth‐wise over‐parameterization, various types of over‐parameterization, etc. ) that are involved in the above method blocks may be referenced to the foregoing sections. Although the foregoing method blocks describes that one or more types of over‐parameterization (such as a depth‐wise over‐parameterization) are performed on a convolutional layer of a neural network, in some instances, one or more types of over‐parameterization (such as a depth‐wise over‐parameterization, or other types of over‐parameterization as described in the foregoing description) may additionally or alternatively be performed on one or more other types of layers (e.g., a fully connected layer, etc. ) of the neural network.
In implementations, some or all of the above method blocks may be implemented or performed by one or more specific processing units of the over‐parameterization system 102. For example, due to a large number of tensor and matrix computations involved in training the neural network, the over‐parameterization system 102 may employ a tensor processing unit, a graphics processing unit, and/or a neural processing unit to perform tensor and matrix computations, thus further improving the performance of the over‐parameterization system 102, and improving the speed of training the neural network in a training phase. In implementations, if the trained neural network is also implemented or used in the over‐parameterization system 102, the over‐parameterization system 102 may further employ such specific processing units to perform tensor and matrix computations involved in making inferences or predictions by the trained neural network in an inference phase.
Although the above method blocks are described to be executed in a particular order, in some implementations, some or all of the method blocks can be executed in other orders, or in parallel.
Conclusion
Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.
The present disclosure can be further understood using the following clauses.
Clause 1: A method implemented by one or more computing devices, the method comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained  neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
Clause 2: The method of Clause 1, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
Clause 3: The method of Clause 2, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.
Clause 4: The method of Clause 1, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
Clause 5: The method of Clause 1, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises: performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or  performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
Clause 6: The method of Clause 1, further comprising performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
Clause 7: The method of Clause 1, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
Clause 8: The method of Clause 1, wherein a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to the at least one convolutional layer.
Clause 9: The method of Clause 1, wherein the specific application comprises an image classification, an object detection, or a semantic segmentation.
Clause 10: One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and  replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
Clause 11: The one or more computer readable media of Clause 11, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
Clause 12: The one or more computer readable media of Clause 11, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.
Clause 13: The one or more computer readable media of Clause 11, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
Clause 14: The one or more computer readable media of Clause 11, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises: performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the  different channels; or performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
Clause 15: The one or more computer readable media of Clause 11, wherein the acts further comprise performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
Clause 16: The one or more computer readable media of Clause 11, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
Clause 17: A system comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
Clause 18: The system of Clause 17, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
Clause 19: The system of Clause 17, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.

Claims (19)

  1. A method implemented by one or more computing devices, the method comprising:
    obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;
    performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;
    training the neural network using training data according to a training method; and
    selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  2. The method of claim 1, wherein performing the depth‐wise over‐parameterization comprises:
    transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and
    associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
  3. The method of claim 2, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:
    combining the at least two separate matrices into a single parameter matrix; and
    associating the single parameter matrix with the new convolutional layer.
  4. The method of claim 1, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
  5. The method of claim 1, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises:
    performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or
    performing an identical over-parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
  6. The method of claim 1, further comprising performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
  7. The method of claim 1, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
  8. The method of claim 1, wherein a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to the at least one convolutional layer.
  9. The method of claim 1, wherein the specific application comprises an image recognition, an image classification, an object detection, or a semantic segmentation.
  10. One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:
    obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;
    performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;
    training the neural network using training data according to a training method; and
    selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  11. The one or more computer readable media of claim 11, wherein performing the depth‐wise over‐parameterization comprises:
    transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and
    associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
  12. The one or more computer readable media of claim 11, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:
    combining the at least two separate matrices into a single parameter matrix; and
    associating the single parameter matrix with the new convolutional layer.
  13. The one or more computer readable media of claim 11, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
  14. The one or more computer readable media of claim 11, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises:
    performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or
    performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
  15. The one or more computer readable media of claim 11, wherein the acts further comprise performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
  16. The one or more computer readable media of claim 11, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
  17. A system comprising:
    one or more processors; and
    memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising:
    obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;
    performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;
    training the neural network using training data according to a training method; and
    selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
  18. The system of claim 17, wherein performing the depth‐wise over‐parameterization comprises:
    transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and
    associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
  19. The system of claim 17, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:
    combining the at least two separate matrices into a single parameter matrix; and
    associating the single parameter matrix with the new convolutional layer.
PCT/CN2020/097221 2020-06-19 2020-06-19 Depth-wise over-parameterization WO2021253440A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080100158.XA CN115461754A (en) 2020-06-19 2020-06-19 Depth over-parameterization
PCT/CN2020/097221 WO2021253440A1 (en) 2020-06-19 2020-06-19 Depth-wise over-parameterization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/097221 WO2021253440A1 (en) 2020-06-19 2020-06-19 Depth-wise over-parameterization

Publications (1)

Publication Number Publication Date
WO2021253440A1 true WO2021253440A1 (en) 2021-12-23

Family

ID=79269041

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097221 WO2021253440A1 (en) 2020-06-19 2020-06-19 Depth-wise over-parameterization

Country Status (2)

Country Link
CN (1) CN115461754A (en)
WO (1) WO2021253440A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2623551A (en) * 2022-10-20 2024-04-24 Continental Automotive Tech Gmbh Systems and methods for learning neural networks for embedded applications

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100272311A1 (en) * 2007-02-14 2010-10-28 Tal Nir Over-Parameterized Variational Optical Flow Method
CN110084356A (en) * 2018-01-26 2019-08-02 北京深鉴智能科技有限公司 A kind of deep neural network data processing method and device
CN110263909A (en) * 2018-03-30 2019-09-20 腾讯科技(深圳)有限公司 Image-recognizing method and device
CN111178626A (en) * 2019-12-30 2020-05-19 苏州科技大学 Building energy consumption prediction method and monitoring prediction system based on WGAN algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100272311A1 (en) * 2007-02-14 2010-10-28 Tal Nir Over-Parameterized Variational Optical Flow Method
CN110084356A (en) * 2018-01-26 2019-08-02 北京深鉴智能科技有限公司 A kind of deep neural network data processing method and device
CN110263909A (en) * 2018-03-30 2019-09-20 腾讯科技(深圳)有限公司 Image-recognizing method and device
CN111178626A (en) * 2019-12-30 2020-05-19 苏州科技大学 Building energy consumption prediction method and monitoring prediction system based on WGAN algorithm

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2623551A (en) * 2022-10-20 2024-04-24 Continental Automotive Tech Gmbh Systems and methods for learning neural networks for embedded applications

Also Published As

Publication number Publication date
CN115461754A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
JP7462623B2 (en) System and method for accelerating and embedding neural networks using activity sparsification
Habib et al. Optimization and acceleration of convolutional neural networks: A survey
US20210166112A1 (en) Method for neural network and apparatus performing same method
US11593658B2 (en) Processing method and device
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
US20190278600A1 (en) Tiled compressed sparse matrix format
CN107622303B (en) Method for neural network and device for performing the method
CN111989697A (en) Neural hardware accelerator for parallel and distributed tensor computation
CN112673383A (en) Data representation of dynamic precision in neural network cores
US20210012178A1 (en) Systems, methods, and devices for early-exit from convolution
US11295236B2 (en) Machine learning in heterogeneous processing systems
US11934949B2 (en) Composite binary decomposition network
US20210125071A1 (en) Structured Pruning for Machine Learning Model
US11429394B2 (en) Efficient multiply-accumulation based on sparse matrix
de Prado et al. Automated design space exploration for optimized deployment of dnn on arm cortex-a cpus
WO2021253440A1 (en) Depth-wise over-parameterization
US11710042B2 (en) Shaping a neural network architecture utilizing learnable sampling layers
JP7150651B2 (en) Neural network model reducer
US11481604B2 (en) Apparatus and method for neural network processing
Sun et al. Computation on sparse neural networks: an inspiration for future hardware
Sun et al. Computation on sparse neural networks and its implications for future hardware
US11704562B1 (en) Architecture for virtual instructions
US11900239B2 (en) Systems and methods for accelerating sparse neural network execution
CN115601513A (en) Model hyper-parameter selection method and related device
WO2021120036A1 (en) Data processing apparatus and data processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940842

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940842

Country of ref document: EP

Kind code of ref document: A1