WO2021253440A1

WO2021253440A1 - Depth-wise over-parameterization

Info

Publication number: WO2021253440A1
Application number: PCT/CN2020/097221
Authority: WO
Inventors: Yangyan LI; Ying Chen; Jinming CAO
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2021-12-23
Also published as: CN115461754A

Abstract

An over-parameterization may be performed on spatial dimensions of an original parameter matrix of a convolutional layer of a neural network, to convert the original parameter matrix of the convolutional layer into two parameter matrices, with a first parameter matrix and a second parameter matrix. Training of the neural network may then be performed to determine values for learnable parameters of the first parameter matrix and the second parameter matrix. After training is completed, the first parameter matrix and the second parameter matrix may be combined as a single parameter matrix and treated as a single convolutional layer, which may then be used in subsequent inference applications of the neural network.

Description

DEPTH‐WISE OVER‐PARAMETERIZATION

BACKGROUND

Convolutional neural networks (CNNs) are a type of deep learning method, which are capable of expressing highly complicated functions. A convolutional neural network may receive an image as an input, assign importance (i.e., learnable weights and biases) to various aspects/objects in the image, and differentiate one aspect/object from other aspects/objects. Accordingly, the convolutional neural networks have been widely used in a variety of computer vision applications, such as image classification, object detection, and semantic segmentation, etc.

A convolutional neural network is made up of a plurality of distinct layers. In general, an accuracy of a convolutional neural network relies heavily on the depth of the network, i.e., the larger the number of layers, the higher the accuracy of the network will be. However, increasing the number of layers in a convolutional neural network may increase the complexity of the network, and lead to a higher amount of computations, an increased cost (in terms of processing and storage resources) , and a longer delay in computations, etc. Furthermore, the increased complexity of the convolutional neural network may increase the time taken for convergence of different variables or parameters of the convolutional neural network (i.e., the time taken for training the convolutional neural network) , and the time taken for computing a result when a prediction or inference is subsequently performed using the convolutional neural network.

SUMMARY

This summary introduces simplified concepts of depth‐wise over‐parameterization, which will be further described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in limiting the scope of the claimed subject matter.

This disclosure describes example implementations of depth‐wise over‐parameterization. In implementations, an over‐parameterization may be performed on spatial dimensions of an original parameter matrix of a convolutional layer of a neural network, to convert the original parameter matrix of the convolutional layer into two parameter matrices, with a first parameter matrix and a second parameter matrix. In implementations, training may then be performed on the neural network to determine values for learnable parameters of the first parameter matrix and the second parameter matrix. After training is completed, the first parameter matrix and the second parameter matrix may be combined as a single parameter matrix and treated as a single convolutional layer, which may then be used in subsequent inference applications of the neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanying figures. In the figures, the left‐most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.

FIG. 1 illustrates an example environment in which an example over‐parameterization system may be used.

FIG. 2A illustrates the example over‐parameterization system in more detail.

FIG. 2B illustrates an example neural network processing architecture that can be used for implementing the example over‐parameterization system.

FIG. 2C illustrates an example cloud system that incorporates the example neural network processing architecture to implement the example over‐parameterization system.

FIG. 3 illustrates an example neural network in which an over‐parameterization may be performed.

FIG. 4 illustrates an example normal convolution

FIG. 5 illustrates an example depth‐wise convolution.

FIGS. 6A and 6B illustrate example ways of a depth‐wise over‐parameterization operator.

FIGS. 7A and 7B illustrate example ways of obtaining a depth‐wise over‐parameterized depth‐wise convolutional layer.

FIG. 8 illustrates an example method of over‐parameterization.

DETAILED DESCRIPTION

Overview

As noted above, although existing convolutional neural networks can model complicated functions, and are useful in a variety of applications such as computer vision applications, the accuracies of the existing convolutional neural networks rely heavily on corresponding depths of the neural networks. An existing convolutional neural network is therefore severely suffered from limitations due to tradeoffs between the accuracy and the speed of making an inference or prediction by the convolutional neural network.

This disclosure describes an example over‐parameterization system. In implementations, the over‐parameterization system may virtually convert a convolutional layer of a neural network into at least two linear layers, which are combined back into a single layer after training. In implementations, the over‐parameterization system may perform an over‐parameterization on spatial dimensions of a parameter matrix that represents the convolutional layer to obtain a first parameter matrix and a second parameter matrix, which virtually correspond to the two linear layers. In implementations, the first parameter matrix may constitute an additional layer that is related to depth‐wise over‐parameterization, whereas the second parameter matrix may have a size that is the same as that of the original parameter matrix of the convolutional layer.

In implementations, after obtaining the at least two linear layers, the over‐parameterization system may initiate trainable parameters of the neural network randomly or based on a priori knowledge. Examples of the trainable parameters may include, but are not limited to, additional parameters included in the first parameter matrix, parameters of the second parameter matrix, and model parameters of other layers of the neural network. In implementations, after initiating the trainable parameters of the neural network, the over‐parameterization system may then perform training of the entire neural network based on a training algorithm (e.g., a gradient descent based optimization) using training data to obtain resulting values of the trainable parameters, i.e., a trained neural network.

In implementations, after obtaining the trained neural network, the over‐parameterization system may combine the at least two linear layers back into the convolutional layer. For example, the over‐parameterization system may linearly combine the first parameter matrix and the second parameter matrix into a single parameter matrix, which represents the convolutional layer. The over‐parameterization system or another system may then employ the trained neural network to perform inferences or predictions according to an intended application (such as an image classification, etc. ) .

As described above, the over‐parameterization system involves increasing the number of trainable parameters associated with a convolutional layer of a neural network by performing an over‐parameterization on a parameter matrix of the convolutional layer to form at least two parameter matrices. Although additional trainable parameters are added, the over‐parameterization (i.e., adding a linear layer) actually helps accelerating the convergence of trainable parameters of the neural network, and thus increases the speed of training. Furthermore, since the at least two parameter matrices are combined into a single parameter matrix (i.e., at least two linear layers represented by the at least two parameter matrices are combined back into a single convolutional layer) after training, an amount of computations associated with such single convolutional layer is equivalent or similar to that of a convolutional layer without the over‐parameterization being performed, when the trained neural network is employed for performing inferences or predictions according to an intended application (such as an image classification, etc. ) .

In implementations, functions described herein to be performed by the over‐parameterization system may be performed by multiple separate units or services. Moreover, although in the examples described herein, the over‐parameterization system may be implemented as a combination of software and hardware implemented and distributed in multiple devices, in other examples, the over‐parameterization system may be implemented and distributed as services provided in one or more computing devices over a network and/or in a cloud computing architecture.

The application describes multiple and varied embodiments and implementations. The following section describes an example framework that is suitable for practicing various implementations. Next, the application describes example systems, devices, and processes for implementing an over‐parameterization system.

Example Environment

FIG. 1 illustrates an example environment 100 usable to implement an over‐parameterization system. The environment 100 may include an over‐parameterization system 102. In implementations, the over‐parameterization system 102 may include a plurality of servers 104‐1, 104‐2, …, 104‐N (which are collectively called as servers 104) . The servers 104 may communicate data with one another via a network 106.

In implementations, each of the servers 104 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.

The network 106 may be a wireless or a wired network, or a combination thereof. The network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) . Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g.,

Zigbee, etc. ) , etc.

In implementations, the environment 100 may further include a client device 110. The client device 108 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.

In implementations, the over‐parameterization system 102 may receive a request for training a neural network (such as a convolutional neural network) from the client device 108. The over‐parameterization system 102 may then perform training of the neural network according to the request from the client device 108.

Example Over‐Parameterization System

FIG. 2A illustrates the over‐parameterization system 102 in more detail. In implementations, the over‐parameterization system 102 may include, but is not limited to, one or more processors 202, an input/output (I/O) interface 204, and/or a network interface 206, and memory 208. In implementations, some of the functions of the over‐parameterization system 102 may be implemented using hardware, for example, an ASIC (i.e., Application‐Specific Integrated Circuit) , a FPGA (i.e., Field‐Programmable Gate Array) , and/or other hardware.

In implementations, the processors 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the I/O interface 204, and/or the network interface 206. In implementations, the processors 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, a neural processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field‐programmable gate arrays (FPGAs) , application‐specific integrated circuits (ASICs) , application‐specific standard products (ASSPs) , system‐on‐a‐chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.

The memory 208 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non‐volatile memory, such as read only memory (ROM) or flash RAM. The memory 208 is an example of computer readable media.

The computer readable media may include a volatile or non‐volatile type, a removable or non‐removable media, which may achieve storage of information using any method or technology. The information may include a computer readable instruction, a data structure, a program module or other data. Examples of computer readable media include, but not limited to, phase‐change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random‐access memory (RAM) , read‐only memory (ROM) , electronically erasable programmable read‐only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read‐only memory (CD‐ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non‐transmission media, which may be used to store information that may be accessed by a computing device. As defined herein, the computer readable media does not include any transitory media, such as modulated data signals and carrier waves.

Although in this example, only hardware components are described in the over‐parameterization system 102, in other instances, the over‐parameterization system 102 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 208 for performing various operations. For example, the over‐parameterization system 102 may further include a parameter database 210 for storing parameter data of one or more machine learning models (such as neural network models, etc. ) , a training database 212 for storing training data, and other program data 214.

FIG. 2B illustrates an example neural network processing architecture 216 that can be used for implementing the over‐parameterization system 102. In implementations, the neural network processing architecture 216 (such as the architecture used for a neural processing unit) may include a heterogeneous computation unit (HCU) 218, a host unit 220, and a host memory 222. The heterogeneous computation unit 218 may include a special‐purpose computing device or hardware used for facilitating and performing neural network computing tasks. By way of example and not limitation, the heterogeneous computation unit 218 may perform algorithmic operations including operations associated with machine learning algorithms. In implementations, the heterogeneous computation unit 218 may be an accelerator, which may include, but is not limited to, a neural network processing unit (NPU) , a graphic processing unit (GPU) , a tensor processing unit (TPU) , a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a digital signal processor, etc.

In implementations, the heterogeneous computation unit 218 may include one or more computing units 224, a memory hierarchy 226, a controller 228, and an interconnect unit 230. The computing unit 224 may access the memory hierarchy 226 to read and write data in the memory hierarchy 226, and may further perform operations, such as arithmetic operations (e.g., multiplication, addition, multiply‐accumulate, etc. ) on the data. In implementations, the computing unit 224 may further include a plurality of engines that are configured to perform various types of operations. By way of example and not limitation, the computing unit 224 may include a scalar engine 232 and a vector engine 234. The scalar engine 232 may perform scalar operations such as scalar product, convolution, etc. The vector engine 234 may perform vector operations such as vector addition, vector product, etc.

In implementations, the memory hierarchy 226 may include an on‐chip memory (such as 4 blocks of 8GB second generation of high bandwidth memory (HBM2) ) to serve as a main memory. The memory hierarchy 226 may be configured to store data and executable instructions, and allow other components of the neural network processing architecture 216 (e.g., the heterogeneous computation unit (HCU) 218, and the host unit 220) , the heterogeneous computation unit 218 (e.g., the computing units 224 and the interconnect unit 230) , and/or a device external to the neural network processing architecture 216 to access the stored data and/or the stored instructions with high speed, for example.

In implementations, the interconnect unit 230 may provide or facilitate communications of data and/or instructions between the heterogeneous computation unit 218 and other devices or units (e.g., the host unit 220, one or more other HCU (s) ) that are external to the heterogeneous computation unit 218. In implementations, the interconnect unit 230 may include a peripheral component interconnect express (PCIe) interface 236 and an inter‐chip connection 238. The PCIe interface 236 may provide or facilitate communications of data and/or instructions between the heterogeneous computation unit 218 and the host unit 220. The inter‐chip connection 238 may serve as an inter‐chip bus to connect the heterogeneous computation unit 218 with other devices, such as other HCUs, an off‐chip memory, and/or peripheral devices.

In implementations, the controller 228 may be configured to control and coordinate operations of other components included in the heterogeneous computation unit 218. For example, the controller 228 may control and coordinate different components in the heterogeneous computation unit 218 (such as the scalar engine 232, the vector engine 234, and/or the interconnect unit 230) to facilitate parallelism or synchronization among these components.

In implementations, the host memory 222 may be an off‐chip memory such as a memory of one or more processing units of a host system or device that includes the neural network processing architecture 216. In implementations, the host memory 222 may include a DDR memory (e.g., DDR SDRAM) or the like, and may be configured to store a large amount of data with slower access speed, as compared to an on‐chip memory that is integrated within the one or more processing units, to act as a higher‐level cache.

In implementations, the host unit 220 may include one or more processing units (e.g., an X86 central processing unit (CPU) ) . In implementations, the host system or device having the host unit 220 and the host memory 222 may further include a compiler (not shown) . The compiler may be a program or computer software configured to convert computer codes written in a certain programming language into instructions that are readable and executable by the heterogeneous computation unit 218. In machine learning applications, the compiler may perform a variety of operations, which may include, but are not limited to, pre‐processing, lexical analysis, parsing, semantic analysis, conversion of an input program to an intermediate representation, code optimization, and code generation, or any combination thereof.

FIG. 2C illustrates an example cloud system 240 that incorporates the neural network processing architecture 216 to implement the over‐parameterization system 102. The cloud system 240 may provide cloud services with machine learning and artificial intelligence (AI) capabilities, and may include a plurality of servers, e.g., servers 242‐1, 242‐2, and 242‐K (which are collectively called as servers 242) , where K is a positive integer. In implementations, one or more of the servers 242 may include the neural network processing architecture 216. Using the neural network processing architecture 216, the cloud system 240 may provide part or all of the functionalities of the over‐parameterization system 102, and other machine learning and artificial intelligence capabilities such as image recognition, facial recognition, translations, 3D modeling, etc.

In implementations, although the cloud system 240 is described above, in some instances, the neural network processing architecture 216 that provides some or all of the functionalities of the over‐parameterization system 102 may be deployed in other types of computing devices, which may include, but are not limited to, a mobile device, a tablet computer, a wearable device, a desktop computer, etc.

Example Neural Network Model

FIG. 3 illustrates an example neural network model 300 (or simply a neural network 300) in which an over‐parameterization may be performed. Although a convolutional neural network model is described in this example, the present disclosure may also be applicable to other types of neural network models that involve convolution and/or neural network models having more or fewer types of layers that will be described hereinafter.

In implementations, the neural network 300 may include a plurality of building blocks or distinct types of layers. By way of example and not limitation, the plurality of building blocks or distinct types of layers may include, but are not limited to, one or more convolutional layers 302 (only one is shown for the sake of simplicity) , one or more pooling layers 304 (only one is shown for the sake of simplicity) , and a plurality of fully connected layers 306‐1, …, 306‐S (or collectively called as fully connected layers 306) , where S is an integer greater than one.

In implementations, an input and an output of a layer may be depicted as a feature map, which may be a tensor in R ^H×W×C, where H, W, and C represent a height, a width, and a number of channels of the feature map. Dimensions of the height and the width may define a resolution of the feature map, and may be referred to as spatial dimensions. In implementations, an input feature map may be denoted as a tensor

where H _in, W _in, and C _in represent a height, a width, and a number of channels of the input feature map. An output feature map may be denoted as a tensor

where H _out, W _out, and C _out represent a height, a width, and a number of channels of the output feature map. In implementations, a layer of the neural network may be defined as a matrix which converts the input tensor

to the input tensor

and such matrix may be denoted as a parameter matrix.

In implementations, the fully connected layer 306 may connect each element in an input tensor

to each element in an output tensor

This may be achieved through a parameter matrix

More specifically,

where

and

are reshaped

and

respectively. In implementations, a tensor can be reshaped without the content thereof being changed. For example, an original four‐dimensional tensor

may be reshaped into another three‐dimensional tensor

or

where M=J×K can be deduced because

and

have the same number of elements, with

in the original four‐dimensional tensor

and

in the reshaped three‐dimensional tensor

corresponding to the same element. In implementations, a computation of a fully connected layer may be a matrix‐vector product. In implementations, a hyper‐parameter involved in a fully connected layer is the number of output elements, i.e., H _out×W _out×C _out.

In implementations, due to the full connection nature of a fully connected layer, a parameter matrix

for the fully connected layer can be very large, which normally requires a large amount of computing and memory resources. Therefore, the plurality of fully connected layers 306 may be applied at the end of the neural network 300, after a size (e.g., a spatial resolution) of an input feature map to a first fully connected layer of the plurality of fully connected layers 306 is significantly reduced from an initial value (e.g., an original size or resolution of an image in an image recognition application, etc. ) at the beginning of the neural network 300.

In implementations, in order to reduce a size of a parameter matrix for each fully connected layer 306 at the end of the neural network 300, the one or more convolutional layers 302 may be applied at the beginning of the neural network 300. In a convolutional layer 302, instead of connecting each element in an input tensor

to each element in an output tensor

patches of

may be connected to elements in

by a convolution operator. For instance, if a patch of a spatial size of M×N that is sampled from a spatial location (h _in, w _in) of

is denoted as

and a parameter matrix of a convolutional layer is denoted as

an output of the convolutional layer may be depicted as:

In implementations, each element in

may be connected to one patch in

and thus a size of the parameter matrix may not depend on a spatial resolution of an input feature map of the convolutional layer, and may depend on the size of the patch and input and output channels. As compared with a parameter matrix for a fully connected layer, a parameter matrix of a convolutional layer is much smaller, thus reducing the number of parameters to be trained or optimized.

In implementations, hyper‐parameters involved in a convolutional layer may include a patch size or kernel size M×N, a dilation rate a×b, a stride s×t, and the number of output channels c _out. In implementations, the patch size or kernel size may define a shape and a size of a patch, whereas the dilation size may refer to a gap between elements that are extracted from an input feature map to form a patch. A combination of the kernel size and the dilation rate may define a receptive field of the convolution layer, i.e., (M+ (M-1) × (a-1) ) × (N+ (N-1) × (b-1) ) , which shows that a larger receptive field can be obtained by increasing the kernel size and/or the dilation rate. In implementations, the larger the receptive field is, the better the capability of the convolutional layer may be. The stride may define a gap between spatial locations from where patches are sampled. The stride may further define a relationship between H _in×W _in and H _out×W _out as H _out×W _out=H _in/s×W _in/t, or

and

where P is a padding size. In other words, the stride may define a mapping between (h _out, w _out) and (h _in, w _in) , i.e., to which patch each element in an output feature map is connected. In implementations, a convolutional layer having a stride that is larger than 1 1 may result in a lower spatial resolution of an output feature map of the convolutional layer as compared to that of an input feature map of the convolutional layer.

In implementations, the pooling layer 304 may be configured to gradually reduce a size of a representation space and thus to reduce the number of parameters and an amount of computations in the neural network 300, thereby controlling or avoiding over‐fitting. In implementations, the pooling layer 304 may be a parameter‐free layer, and perform a calculation of a predefined or fixed function on its input according to a type of the pooling layer 304. For example, a max‐pooling layer may perform a maximum operation on its input, and an average‐pooling layer may perform an average (AVG) operation on its input. In implementations, the pooling layer 304 may operate on each feature map independently.

In implementations, due to non‐linear nature of a number of real‐world problems, the neural network 300 may further include one or more activation layers 308 (or called as one or more rectified linear unit layers 308) to enable the neural network 300 to model such non‐linear problems (FIG. 3 only shows one activation layer is shown for the sake of simplicity) . In implementations, each activation layer 308 may apply a non‐linear activation function on its input to achieve a non‐linear input‐to‐output conversion. Examples of the non‐linear activation function that may be used by the activation layer 308 may include, but are not limited to, a sigmoid function (e.g.,

) , a tanh function (e.g.,

) , a relu function (e.g., f (x) =max (0, x) ) , etc.

In implementations, the neural network 300 may further include one or more batch normalization layers 310 (FIG. 3 only shows one batch normalization layer is shown for the sake of simplicity) . Each batch normalization layer 310 may be configured to normalize an output of a previous layer of the neural network 300 by subtracting a batch mean and dividing the output by a batch standard deviation, thus increasing the stability of the neural network 300. In implementations, the batch normalization layer 310 may add two trainable parameters (a standard deviation parameter and a mean parameter) to each layer, so that a normalized output may be multiplied by the standard deviation parameter and added with the mean parameter.

In implementations, the batch normalization layer 310 may be configured to target on normalizing a distribution of each element of an input (e.g., an input vector) across batch data, and reduce over‐fitting by regularization effects which can alleviate internal covariate shift problems.

In implementations, the over‐parameterization system 102 may perform training of the neural network 300 to optimize model parameters of the neural network 300 in order to achieve a desired performance (e.g., a high accuracy of inference or prediction) of the neural network 300. In implementations, the training of the neural network 300 may include a process of finding desired or optimal parameters of the neural network 300 in a predefined hypothesis space to obtain a desired or optimal performance. In implementations, the over‐parameterization system 102 may select an initialization method, and initiate the parameters of the neural network 300 according to the selected initialization method. By way of example and not limitation, the initialization method may include, but is not limited to, initializing the parameters of the neural network 300 with constants (e.g., zero, ones, specified constant) , initializing the parameters of the neural network 300 with random values from a predefined distribution (e.g., a normal distribution, a uniform distribution, etc. ) , initializing the parameters of the neural network 300 based on a specified initialization (such as Xavier initialization, a He initialization, etc. ) , etc.

In implementations, after initializing the parameters of the neural network 300, the over‐parameterization system 102 may perform forward propagation or feed‐forward propagation to pass training inputs (such as a plurality of training images with known objects in image or object recognition, for example) to the neural network 300 and obtain respective estimated outputs from the neural network 300 straightforwardly. The over‐parameterization system 102 may then compute the performance of the neural network 300 based on a loss function or an error function (e.g., an accuracy of the neural network 300 based on a comparison between the estimated outputs and the corresponding known objects of the plurality of training images in the image or object recognition in this example) .

In implementations, the over‐parameterization system 102 may compute a derivative of the loss function, for example, to determine error information of the training inputs that is obtained under current values of the parameters of the neural network 300. The over‐parameterization system 102 may perform backward propagation or back‐propagation to propagate the error information backward through the neural network 300, and adjust or update the values of the parameters of the neural network 300 according to a gradient descent algorithm, for example. The over‐parameterization system 102 may continue to iterate the foregoing operations (i.e., from performing the forward propagation to adjusting or updating the values of the parameters of the neural network 300) until the values of the parameters of the neural network 300 converge, or until a predefined number of iterations is reached.

In implementations, after the neural network 300 is trained and desired or optimal values of the parameters of the neural network 300 are obtained, the over‐parameterization system 102 may allow the neural network 300 to perform inferences or predictions for new inputs, e.g., new images with objects to be classified, for example.

Example Types of Over‐Parameterization

In implementations, the over‐parameterization system 102 may represent a single parameter matrix

of a certain layer (for example, a convolutional layer, or a fully connected layer, etc. ) of a neural network as a multiplication of two parameter matrices, e.g.,

and

In other words,

or

denotes a vanilla matrix of the layer, and has a same shape (i.e., a same number of rows and a same number of columns) as that of

The other parameter matrix

is an over‐parameterization matrix. In implementations, an over‐parameterization matrix may be a left‐multiplication matrix or a right‐multiplication matrix. Both

and

represent the same underlying linear transformation, transforming

to

is considered as over‐parameterization because the total number of parameters included in

and

is apparently more than that of

In implementations, for a convolutional layer of a neural network, a parameter matrix

may be reshaped from a tensor in

In implementations, different channels of a parameter matrix of a convolutional layer may be over‐parameterized, leading to different types of channel‐ wise over‐parameterization, such as an in‐channel‐wise over‐parameterization, an out‐channel‐wise over‐parameterization, an all‐channel‐wise over‐parameterization, for example.

In implementations, in an over‐parameterized layer (such as an over‐parameterized convolutional layer) , matrices (e.g.,

and

) involved in over‐parameterization may be optimized simultaneously in a training phase of the neural network, and may be combined together into a single parameter matrix (e.g.,

) after the training. As a result,

instead of

and

may be used for performing inferences in an inference phase, thus resulting in a same amount of computations as that of a conventional layer without over‐parameterization. In other words, over‐parameterization does not increase any amount of computations in the inference phase.

In implementations, the over‐parameterization system 102 may perform a variety of different types of over‐parameterization for a layer (a fully connected layer or a convolutional layer) of the neural network. The variety of different types of over‐parameterization may include, but is not limited to, a full‐row over‐parameterization, a full‐column over‐parameterization, a channel‐wise over‐parameterization (which may include an in‐channel‐wise over‐parameterization, an out‐channel‐wise over‐parameterization, an all‐channel‐wise over‐parameterization, for example) , a depth‐wise over‐parameterization, etc.

For example, the full‐row over‐parameterization may include an over‐parameterization operating on an entire row of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e.,

The full‐column over‐parameterization may include an over‐parameterization operating on an entire column of a parameter matrix of a layer (e.g., a fully connected layer or a convolutional layer) of a neural network, i.e.,

For example, the in‐channel‐wise over‐parameterization may include an over‐parameterization that operates only on a C _in channel part of a parameter matrix (with dimension of C _out× (M×N×C _in) ) of a convolutional layer. For example, the in‐channel‐wise over‐parameterization may be expressed as follows:

where

is over‐parameterized with

and

In implementations, the out‐channel‐wise over‐parameterization may include an over‐parameterization that operates only on a C _out channel part of a parameter matrix (with dimension of C _out× (M×N×C _in) ) of a convolutional layer. For example, the out‐channel‐wise over‐parameterization may be expressed as follows:

where

is over‐parameterized with

and

In implementations, a parameter matrix of a convolutional layer may be over‐parameterized by more than one over‐parameterization matrix. By way of example and not limitation, the all‐channel‐wise over‐parameterization may include an over‐parameterization that operates on both C _in and C _out channel parts of a parameter matrix (with dimension of C _out× (M×N×C _in) ) of a convolutional layer. For example, the all‐channel‐wise over‐parameterization may be expressed as follows:

where

is over‐parameterized with an over‐parameterization matrix

an over‐parameterization matrix

and a vanilla matrix

In implementations, the depth‐wise over‐parameterization may include an over‐parameterization that operates on spatial dimensions (i.e., within a channel) of a parameter matrix of a convolutional layer. In implementations, the depth‐wise over‐parameterization may employ a same over‐parameterization matrix for each input channel, i.e.,

is the same for each input channel. By way of example and not limitation, rather than applying a full‐row or full‐column over‐parameterization (which over‐parameterizes an entire row or column of a parameter matrix of a convolutional layer) or applying a channel‐wise over‐parameterization (which over‐parameterizes channel dimension (s) of the parameter matrix of the convolutional layer) , a parameter matrix for a convolutional layer

may be over‐parameterized on a part that corresponds to spatial dimensions of input patches

i.e., M×N dimensions, with

and

as follows:

where over‐parameterization is applied within each channel.

Alternatively, the depth‐wise over‐parameterization may employ different over‐parameterization matrices for different input channels, i.e., an over‐parameterization matrix (e.g.,

) of one channel may be different from an over‐parameterization matrix (e.g.,

) of another channel. For example, a parameter matrix for a convolutional layer

may be over‐parameterized with

and

as follows:

where over‐parameterization is applied within each channel, and different (M×N) × (M×N) matrices are used for different input channels in this case. This type of depth‐wise over‐parameterization may be called an independent depth‐wise over‐parameterization.

Alternatively, in implementations, the independent depth‐wise over‐parameterization may employ different over‐parameterization matrices for at least two different input channels.

Example Depth‐wise Convolutional Layer

In implementations, given an input feature map, a convolutional layer of a neural network may process the input feature map in a sliding window manner, which may include apply a set of convolution kernels to a patch having a same size as that of the convolution kernel at each window position. If a patch is denoted as a 2‐dimensional tensor

trainable kernels of a convolutional layer may be represented as a 3‐dimensional tensor

where M and N are spatial dimensions of the patch, C _in is the number of channels in the input feature map of the convolutional layer, and C _out is the number of channels in an output feature map of the convolutional layer. In a convolutional layer, dot products may be computed between each of the C _out kernels and an entire input patch tensor

FIG. 4 illustrates an example normal convolution. An output of a normal convolution operator (which is represented as *) may be a C _out‐dimensional feature

In depth‐wise convolution, each of C _in channels of the input patch tensor

may be involved in D _mul separate or individual dot products. In implementations, each input patch channel (i.e., a M×N‐dimensional feature) may be transformed into a D _mul‐dimensional feature. For the sake of description, D _mul is called as a depth multiplier herein. FIG. 5 illustrates an example depth‐wise convolution. As shown in FIG. 5, a trainable depth‐wise convolution kernel may be represented as a 3‐dimensional tensor

Since each input channel may be converted into a D _mul‐dimensional feature, an output of a depth‐wise convolution operator (which is represented as ο) may be a D _mul×C _in‐dimensional feature

In implementations, a convolutional layer that is over‐parameterized with a depth‐wise parameterization (or simply called as a depth‐wise over‐parameterized convolutional layer) may be a composition of a depth‐wise convolution with trainable kernel

and a normal convolution with trainable kernel

where D _mul≥M×N . In implementations, given an input patch

an output of an depth‐wise over‐parameterization operator (which is represented as

) may be the same as that of a convolutional layer, a C _out‐dimensional feature

FIGS. 6A and 6B illustrate two example ways of a depth‐wise over‐parameterization operator. As shown in FIGS. 6A and 6B, the depth‐wise over‐parameterization operator may be applied in two mathematically equivalent ways as follows:

where

is a transpose of

on a first axis and a second axis.

In implementations, the first manner (i.e.,

) is called a feature composition as shown in FIG. 6A, and involves first applying the trainable kernel

to the input patch

by a depth‐wise convolution operator ο to obtain a transformed feature

and then applying the trainable kernel

to the transformed feature

by a normal convolution operator *to obtain

The second manner (i.e.,

) is called a kernel composition as shown in FIG. 6B, and involves first applying the trainable kernel

to transform

by a depth‐wise convolution operator ο to obtain

and then applying a normal convolution operator *between

and

to obtain

In implementations, a receptive field of a depth‐wise over‐parameterized convolutional layer is M×N, and an interface of a depth‐wise over‐parameterized convolutional layer is the same as an interface of a normal convolution layer. Therefore, a depth‐wise over‐parameterized convolutional layer may easily replace a normal convolution layer in a neural network. Since a depth‐ wise over‐parameterization operator is differentiable, both

and

of a depth‐wise over‐parameterized convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers. After training of the neural network is completed,

and

may be combined to obtain

where this single matrix

may then be used for making inferences. Since

has the same shape as that of a kernel of a convolutional layer, computation of the depth‐wise over‐parameterized convolutional layer at an inference phase is the same as that of a normal convolutional layer.

In implementations, the feature composition and the kernel composition used for performing the depth‐wise over‐parameterization operator may lead to different training efficiencies of a depth‐wise over‐parameterized convolutional layer and hence a neural network that is involved. By way of example and not limitation, if the number of multiply and accumulate operations (MACC) is used as a metric for measuring or determining an amount of computations and serves as an efficiency indicator, respective MACC costs for the feature composition and the kernel composition, when being applied on a feature map in

(where H and W are the height and the width of the feature map) , may be calculated as follows:

Feature composition:

Kernel composition:

As can be seen from the above, the MACC costs for the feature composition and the kernel composition depend on values of hyper‐parameters that are involved. Since H×W＞＞C _out and D _mul＞＞M×N, the kernel composition may generally incurs fewer MACC operations as compared to the feature composition, and an amount of memory consumed by

in the kernel composition may normally be smaller than that consumed by

in the feature composition. Therefore, the kernel composition may be selected for performing the depth‐wise over‐parameterization operator when training the neural network.

In implementations, in addition to applying over a normal convolution to obtain a depth‐wise over‐parameterized convolutional layer, the depth‐wise over‐parameterization may further be allowed to apply over a depth‐wise convolution, which leads to a depth‐wise over‐parameterized depth‐wise convolutional layer. Similar to the principles used for obtaining a depth‐wise over‐parameterized convolutional layer, FIGS. 7A and 7B illustrate example ways of obtaining a depth‐wise over‐parameterized depth‐wise convolutional layer. In implementations, an operator (which is represented as

) of depth‐wise over‐parameterization over depth‐wise convolution may be obtained or computed in two mathematically equivalent ways as follows:

In implementations, training of a neural network including a depth‐wise over‐parameterized depth‐wise convolutional layer may be similar to training of a neural network including a depth‐wise over‐parameterized convolutional layer, and both

and

of the depth‐wise over‐parameterized depth‐wise convolutional layer may be optimized using, for example, a gradient descent based optimizer that is commonly used for training neural networks (such as CNNs) with convolutional layers. After training of the neural network is completed,

and

may be combined to obtain a single matrix

which may then be used for making inferences.

Example Methods

FIG. 8 shows a schematic diagram depicting an example method of over‐parameterization. The method of FIG. 8 may, but need not, be implemented in the environment of FIG. 1 and using the system of FIG. 2, with reference to the neural network of FIG. 3, and the convolutions of FIGS. 4‐7. For ease of explanation, method 800 is described with reference to FIGS. 1‐7. However, the method 800 may alternatively be implemented in other environments and/or using other systems.

The method 800 is described in the general context of computer‐executable instructions. Generally, computer‐executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.

Referring back to FIG. 8, at block 802, the over‐parameterization system 102 may obtain information of a neural network that includes one or more convolutional layers and one or more other layers.

In implementations, the over‐parameterization system 102 may receive or obtain information of a neural network to be trained from a database (such as the parameter database 210) , or a client device (such as the client device 108) . In implementations, the information of the neural network may include, but is not limited to, a type of the neural network, hyper‐parameters of the neural network, initial values of the hyper‐parameters of the neural network, trainable parameters of the neural network, a structure (such as the number of layers, types of layers, etc. ) , etc.

In implementations, after receiving or obtaining the information of the neural network, the over‐parameterization system 102 may initialize the trainable parameters of the neural network randomly or based on a priori knowledge. In implementations, the over‐parameterization system 102 or the database 210 may store information of one or more trained neural networks that are similar to the neural network to be trained. The over‐parameterization system 102 may initialize the trainable parameters of the neural network based at least in part on the information of the one or more trained neural networks that are similar to the neural network to be trained.

In implementations, the neural network may include, but is not limited to, one or more convolutional layers, a plurality of fully connected layers, one or more pooling layers, one or more activation layers, one or more batch normalization layers, etc. In implementations, Examples of the neural network may include, but are not limited to, a convolutional neural network, or any neural networks having one or more convolutional layers, etc.

At bock 804, the over‐parameterization system 102 may perform depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer.

In implementations, the over‐parameterization system 102 may perform depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer. In implementations, a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to a number of parameters of the at least one convolutional layer. By way of example not limitation, the over‐parameterization system 102 may transform a parameter matrix associated with the at least one convolutional layer into at least two separate matrices, and associate the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer. In implementations, the parameter matrix associated with the at least one convolutional layer may include a plurality of channels representing a color space.

In implementations, the over‐parameterization system 102 may perform over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices according to a depth‐wise over‐parameterization as described in the foregoing description.

Additionally, the over‐parameterization system 102 may perform the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels. In implementations, the over‐parameterization system 102 may perform an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.

Additionally, the over‐parameterization system 102 may further perform channel‐wise over‐parameterization (such as in‐channel‐wise over‐parameterization, out‐channel‐wise over‐parameterization, or all‐channel‐wise over‐parameterization, etc. ) on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.

At block 806, the over‐parameterization system 102 may train the neural network using training data according to a training method.

After performing the over‐parameterization on the at least one convolutional layer, the over‐parameterization system 102 may obtain training data, and train the neural network using training data according to a training method. In implementations, the over‐parameterization system 102 may obtain training data from the training database 212, or from a designated storage location indicated or provided by the client device. Depending on what application the neural network is intended for use, different training data may be used. Examples of the application may include, but are not limited to, an image classification, an object detection, or a semantic segmentation, etc.

For example, if the neural network is intended for performing inferences or predictions in an image classification application, the training data may include a plurality of images (which may be color images and/or grayscale images) with known results (e.g., known information of the images, such as respective classes of objects in the images, etc. ) .

In implementations, the training method may include a variety of training or learning algorithms that may be used for training neural networks. Examples of the training method may include, but are not limited, to a backward propagation algorithm, a gradient descent algorithm, or a combination thereof, etc.

In implementations, although additional trainable parameters are added to the depth‐wise over‐parameterized convolutional layer, the speed of convergence for obtaining optimal or desired trainable parameters of the neural network is actually higher, thus increasing the speed training of the neural network. Furthermore, given that initial values of hyper‐parameters of a neural network are the same, the accuracy of a neural network that is trained using a depth‐wise over‐parameterization is found to be higher as compared to the accuracy of a trained neural network without using the depth‐wise over‐parameterization.

At block 808, the over‐parameterization system 102 may selectively combine the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.

In implementations, after the neural network is trained, the over‐parameterization system 102 may selectively combine the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network. By way of example and not limitation, the over‐parameterization system 102 may combine the at least two separate matrices associated with the depth‐wise over‐parameterization convolutional layer into a single parameter matrix, and associate the single parameter matrix with the new convolutional layer. Since the at least two matrices are combined into a single matrix, the trained neural network that is obtained after such combination has lower computation and memory costs, and avoid extra computation and memory costs when the trained neural network is used for performing inferences or predictions in the intended application.

For the sake of simplicity, a detailed description of operations and algorithms (such as training a neural network, operations associated with a depth‐wise over‐parameterization, various types of over‐parameterization, etc. ) that are involved in the above method blocks may be referenced to the foregoing sections. Although the foregoing method blocks describes that one or more types of over‐parameterization (such as a depth‐wise over‐parameterization) are performed on a convolutional layer of a neural network, in some instances, one or more types of over‐parameterization (such as a depth‐wise over‐parameterization, or other types of over‐parameterization as described in the foregoing description) may additionally or alternatively be performed on one or more other types of layers (e.g., a fully connected layer, etc. ) of the neural network.

In implementations, some or all of the above method blocks may be implemented or performed by one or more specific processing units of the over‐parameterization system 102. For example, due to a large number of tensor and matrix computations involved in training the neural network, the over‐parameterization system 102 may employ a tensor processing unit, a graphics processing unit, and/or a neural processing unit to perform tensor and matrix computations, thus further improving the performance of the over‐parameterization system 102, and improving the speed of training the neural network in a training phase. In implementations, if the trained neural network is also implemented or used in the over‐parameterization system 102, the over‐parameterization system 102 may further employ such specific processing units to perform tensor and matrix computations involved in making inferences or predictions by the trained neural network in an inference phase.

Although the above method blocks are described to be executed in a particular order, in some implementations, some or all of the method blocks can be executed in other orders, or in parallel.

Conclusion

Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.

The present disclosure can be further understood using the following clauses.

Clause 1: A method implemented by one or more computing devices, the method comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.

Clause 2: The method of Clause 1, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.

Clause 3: The method of Clause 2, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.

Clause 4: The method of Clause 1, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.

Clause 5: The method of Clause 1, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises: performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.

Clause 6: The method of Clause 1, further comprising performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.

Clause 7: The method of Clause 1, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.

Clause 8: The method of Clause 1, wherein a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to the at least one convolutional layer.

Clause 9: The method of Clause 1, wherein the specific application comprises an image classification, an object detection, or a semantic segmentation.

Clause 10: One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.

Clause 11: The one or more computer readable media of Clause 11, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.

Clause 12: The one or more computer readable media of Clause 11, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.

Clause 13: The one or more computer readable media of Clause 11, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.

Clause 14: The one or more computer readable media of Clause 11, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises: performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.

Clause 15: The one or more computer readable media of Clause 11, wherein the acts further comprise performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.

Clause 16: The one or more computer readable media of Clause 11, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.

Clause 17: A system comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising: obtaining information of a neural network that includes one or more convolutional layers and one or more other layers; performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer; training the neural network using training data according to a training method; and selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.

Clause 18: The system of Clause 17, wherein performing the depth‐wise over‐parameterization comprises: transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.

Clause 19: The system of Clause 17, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises: combining the at least two separate matrices into a single parameter matrix; and associating the single parameter matrix with the new convolutional layer.

Claims

A method implemented by one or more computing devices, the method comprising:

obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;

performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;

training the neural network using training data according to a training method; and

selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
The method of claim 1, wherein performing the depth‐wise over‐parameterization comprises:

transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and

associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
The method of claim 2, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:

combining the at least two separate matrices into a single parameter matrix; and

associating the single parameter matrix with the new convolutional layer.
The method of claim 1, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
The method of claim 1, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises:

performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or

performing an identical over-parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
The method of claim 1, further comprising performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
The method of claim 1, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
The method of claim 1, wherein a number of parameters that are associated with the depth‐wise over‐parameterization convolutional layer and are trainable is higher as compared to the at least one convolutional layer.
The method of claim 1, wherein the specific application comprises an image recognition, an image classification, an object detection, or a semantic segmentation.
One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:

obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;

performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;

training the neural network using training data according to a training method; and

selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
The one or more computer readable media of claim 11, wherein performing the depth‐wise over‐parameterization comprises:

transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and

associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
The one or more computer readable media of claim 11, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:

combining the at least two separate matrices into a single parameter matrix; and

associating the single parameter matrix with the new convolutional layer.
The one or more computer readable media of claim 11, wherein performing the depth‐wise over‐parameterization comprises performing over‐parameterization on spatial dimensions of a parameter matrix associated with the at least one convolutional layer to obtain at least two separate matrices.
The one or more computer readable media of claim 11, wherein the parameter matrix associated with the at least one convolutional layer comprises a plurality of channels representing a color space, and performing the over‐parameterization on the spatial dimensions of the parameter matrix comprises:

performing the over‐parameterization independently on different channels of the plurality of channels to obtain different over‐parameterization matrices for the different channels; or

performing an identical over‐parameterization on the different channels to obtain a same over‐parameterization matrix for the different channels.
The one or more computer readable media of claim 11, wherein the acts further comprise performing channel‐wise over‐parameterization on the at least one convolutional layer and/or another convolutional layer of the one or more convolutional layers.
The one or more computer readable media of claim 11, wherein the one or more other layers comprise at least one of: one or more fully connected layers, one or more pooling layers, one or more activation layers, or one or more batch normalization layers.
A system comprising:

one or more processors; and

memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising:

obtaining information of a neural network that includes one or more convolutional layers and one or more other layers;

performing depth‐wise over‐parameterization on at least one convolutional layer of the one or more convolutional layers to obtain a depth‐wise over‐parameterization convolutional layer;

training the neural network using training data according to a training method; and

selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain a new convolutional layer, and replacing the depth‐wise over‐parameterization layer by the new convolutional layer in the trained neural network, the trained neural network being configured to perform inferences for a specific application based on input data.
The system of claim 17, wherein performing the depth‐wise over‐parameterization comprises:

transforming a parameter matrix associated with the at least one convolutional layer into at least two separate matrices; and

associating the at least two separate matrices with the depth‐wise over‐parameterization convolutional layer.
The system of claim 17, wherein selectively combining the parameters associated with the depth‐wise over‐parameterization layer to obtain the new convolutional layer comprises:

combining the at least two separate matrices into a single parameter matrix; and

associating the single parameter matrix with the new convolutional layer.