US20200134400A1 - Fast computation of a convolutional neural network - Google Patents

Fast computation of a convolutional neural network Download PDF

Info

Publication number
US20200134400A1
US20200134400A1 US16/390,042 US201916390042A US2020134400A1 US 20200134400 A1 US20200134400 A1 US 20200134400A1 US 201916390042 A US201916390042 A US 201916390042A US 2020134400 A1 US2020134400 A1 US 2020134400A1
Authority
US
United States
Prior art keywords
convolutional
computer
output
input data
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/390,042
Other versions
US10635951B1 (en
Inventor
Yongchao LIU
Qiyin Huang
Guozhen Pan
Sizhong Li
Jianguo Xu
Haitao Zhang
Lin Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Assigned to ALIBABA GROUP HOLDING LIMITED reassignment ALIBABA GROUP HOLDING LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUANG, QIYIN, LI, Sizhong, LIU, Yongchao, PAN, Guozhen, WANG, LIN, XU, JIANGUO, ZHANG, HAITAO
Application granted granted Critical
Publication of US10635951B1 publication Critical patent/US10635951B1/en
Publication of US20200134400A1 publication Critical patent/US20200134400A1/en
Assigned to ADVANTAGEOUS NEW TECHNOLOGIES CO., LTD. reassignment ADVANTAGEOUS NEW TECHNOLOGIES CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALIBABA GROUP HOLDING LIMITED
Assigned to Advanced New Technologies Co., Ltd. reassignment Advanced New Technologies Co., Ltd. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ADVANTAGEOUS NEW TECHNOLOGIES CO., LTD.
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • G06K9/66
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques

Definitions

  • CNN convolutional neural network
  • ConvNet convolutional neural network
  • a CNN in a training phase of a machine learning process, a CNN can be trained or learned based on training data. Then, in a prediction phase of a machine learning process, the trained CNN serves as a model that receives input data and outputs predictions or decisions based on processing and analyzing the input data.
  • CNN convolutional neural network
  • a computer-implemented method includes obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • the previously described implementation is implementable using a computer-implemented method; a non-transitory, computer-readable medium storing computer-readable instructions to perform the computer-implemented method; and a computer-implemented system including a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method/the instructions stored on the non-transitory, computer-readable medium.
  • the described subject matter elevates a CNN model by performing equivalent transformation or conversion of a computational graph to streamline the network structure of the CNN, and thus allows optimization of implementation of the CNN in both device-independent and device-dependent manner.
  • Example device-independent optimizations can include redundant operation removal (e.g. removing identify operators in a deep learning framework, which only performs data transmission from the input tensor to the output tensors with no change of data content), and layer/operation merging to reduce computational complexity (e.g., merging a convolutional layer with a batch normalization layer).
  • Example device-dependent optimizations can include improving hardware efficiency by merging operations (e.g., merging a convolutional layer with a bias addition operation that immediately follows the former) to better exploit hardware computing capability and flexible deploying operations on different underlying architectures to maximize CNN throughput.
  • merging operations e.g., merging a convolutional layer with a bias addition operation that immediately follows the former
  • the described subject matter reduces the computation load of data prediction based on a trained CNN and improves the prediction speed, and thus reduces network latency and improves throughput of the CNN.
  • the described subject matter requires less computation power.
  • the described subject matter can combine multiple operators into a new operator to take advantage of underlying hardware accelerators (such as GPUs, FPGAs or ASIC chips).
  • a convolutional layer is typically followed by a bias addition operation or a batch normalization layer.
  • the convolution and bias addition operations can be combined into a single one in some implementations. In this single operation, the corresponding bias value can be added to the result of convolution directly in the same GPU kernel launch, thus leading to better speed.
  • the combination of a convolutional layer and a batch normalization can be equivalently transformed into a combination of a convolutional layer and a bias addition offline, and then the aforementioned device-dependent optimizations can be applied on convolutional layers and bias addition operations to further take advantage of GPU power to improve computation speed.
  • These examples are applicable as well to other processors or accelerators than GPUs. Other advantages will be apparent to those of ordinary skill in the art.
  • FIG. 1 is a block diagram illustrating an example equivalent transformation of two computational graphs of a convolutional neural network (CNN), according to an implementation of the present disclosure.
  • CNN convolutional neural network
  • FIG. 2 is a screenshot illustrating an example pseudorandom code of fast computation of a CNN, according to an implementation of the present disclosure.
  • FIG. 3 is a flow chart illustrating an example method for fast computation of a CNN, according to an implementation of the present disclosure.
  • FIG. 4 is a block diagram illustrating an example computer system used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures as described in the instant disclosure, according to an implementation of the present disclosure.
  • CNN convolutional neural network
  • a convolutional neural network (CNN or ConvNet) is one of the most representative network structures and technological innovations for deep learning. It has achieved great success in the field of imagery and is also widely used to solve problems in other fields.
  • a CNN can include one or more of an input layer, a convolutional layer, an activation function, a batch normalization, a pooling layer, a fully connected layer, and an output layer. Among them, the convolutional layer typically involves the most computational load and may consume the longest computation time of the entire CNN.
  • a convolutional layer can include one or more filters (also referred to as kernels). The convolutional layer can receive input data, perform a convolution operation of the input data with each of one or more filters of the convolutional layer, and generate output data of the convolutional layer.
  • a CNN can include tens of convolutional layers.
  • the described techniques can help accelerate the convolution operation, which is the core operation of the convolutional layer. In turn, the described techniques can improve the computational efficiency and reduce the computational load of a CNN.
  • the described techniques can have a variety of applications.
  • the described techniques can be applied in face recognition in areas such as unmanned supermarkets, unmanned banks, security protection, and smart cities.
  • deep convolutional neural networks have been used in face recognition.
  • the described techniques can help deal with tasks of face identification, especially among a large population.
  • the described techniques can improve response time of face recognition based on a trained CNN model.
  • the described techniques can reduce the processing time of each request of face recognition. In turn, system throughput can be increased and operation costs can be reduced without increasing computing resources.
  • the described techniques can be used in auto insurance.
  • the described techniques can automatically identify a surface damage of a vehicle based on deep convolutional neural network image technology. For example, after a car accident, the car surface damage can be photographed, and then uploaded to an insurance company's claim server, which can perform automatic damage identification and compensation quota valuation.
  • the described techniques can be used by the insurance claim server to perform automatic damage identification based on deep convolutional neural network technology.
  • the described techniques can allow the insurance company to provide faster and higher throughput without increasing equipment budget.
  • the described techniques can be implemented using a computational graph.
  • Computational graphs can be used to represent machine learning algorithms, especially in the field of deep learning.
  • a computational graph can be a directed graph that includes multiple nodes, where the nodes correspond to operations or variables. Variables can feed their value into operations, and operations can feed their output into other operations. This way, the node in the computational graph can define a function of the variables.
  • the values that are fed into the nodes (i.e., input) and come out of the nodes (i.e., output) can be represented by tensors.
  • a tensor can be regarded as a multi-dimensional array.
  • a tensor can encompass scalars, vectors, matrices, and tensors of a higher rank or dimensions. Using tensors to represent input and output data of the variable and operator can help achieve automatic differentiation, dynamic memory management, and other functionalities.
  • the computational graph only needs to define the function of the operator. It is not necessary to specify the specific implementation of each operator. Therefore, the computational graph provides the flexibility such that the operator can be executed or otherwise implemented on one or more of a CPU or a hardware accelerator such as GPU, FPGA, or AI chip.
  • Computational graphs can be used for model training, model prediction or inference, or other phases of a machine learning process.
  • FIG. 1 is a block diagram illustrating an example equivalent transformation 101 of two computational graphs 100 and 105 of a CNN, according to an implementation of the present disclosure.
  • the two computational graphs 100 and 105 can represent the same CNN.
  • the two computational graphs 100 and 105 illustrate operations or functions of a single convolutional layer of the CNN.
  • a CNN can include multiple layers and can be represented by a computational graph accordingly.
  • tensor X 110 is an input and tensor Z 170 is an output of the convolutional layer of the CNN represented by the computational graph 100 , respectively.
  • the computational graph 100 includes two nodes 130 and 150 . Each node corresponds to a variable or an operator, which can be regarded as a fine-grained basic operation of a neural network.
  • the node 130 represents a convolution operator, Conv2D.
  • the Conv2D 130 can represent the convolution operation of the input data tensor X 110 with a tensor W 120 , where the tensor W 120 represents weights or parameters of filters of the convolutional layer of the CNN.
  • the Conv2D 130 can output tensor Y 140 .
  • the output tensor Y 140 and a bias 160 can serve as the input to the node 150 , which represents the operation of adding the bias 160 to the output data Y 140 , resulting in the output Z 170 of the convolutional layer of the CNN.
  • tensor X′ 115 is an input and tensor Z′ 175 is an output of the convolutional layer of the CNN represented by the computational graph 155 , respectively.
  • the computational graph 155 includes two nodes 135 and 155 .
  • the node 135 represents a convolution operator, Wino-Conv2D.
  • the Wino-Conv2D 135 can represent a modified convolution operation of the input data tensor X′ 115 with a tensor W 125 , where the tensor W 125 represents modified weights or parameters of filters of the convolutional layer of the CNN.
  • the Wino-Conv2D 135 can output tensor Y′ 145 .
  • the output tensor Y′ 145 and a bias 165 can serve as the input to the node 155 , which represents the operation of adding the bias 165 to the output data Y′ 145 , resulting in the output Z′ 175 of the convolutional layer of the CNN.
  • the computational graph 105 can represent an improved or optimized convolution operation for fast computation of the CNN.
  • parameters of the nodes e.g., the values of filter or kernel parameter in the tensor W 120 or tensor U 125
  • the values of these parameters remain unchanged.
  • the tensor U 125 can be computed after the training, before performing prediction based on any input data X′ 115 .
  • the computational load of the CNN can be reduced, especially for prediction of multiple input to a CNN.
  • the new convolution operator Wino-Conv2D 135 can replace the original convolution operator Conv2D 130 in the original computational graph 100 .
  • the new convolution operator Wino-Conv2D 135 uses the tensor U 125 instead of the original tensor W 120 as the filter or kernel parameter.
  • the computational graph 105 can achieve improved computational efficiency while preserving the equivalence with the original computational graph 100 .
  • FIG. 2 is a screenshot illustrating an example pseudo code of an example algorithm 200 for fast computation of a CNN, according to an implementation of the present disclosure.
  • the example algorithm 200 is based on Winograd minimal filtering algorithm, a fast algorithm for CNNs.
  • the convolution operation of a convolutional layer of the CNN can be defined as a specified correlation between an input (e.g., represented by an input tensor X) and a filter or kernel of the convolutional layer (represented by a filter or kernel tensor W), resulting in an output (e.g., represented by an output tensor Y).
  • an input e.g., represented by an input tensor X
  • a filter or kernel tensor W represented by a filter or kernel tensor W
  • an output e.g., represented by an output tensor Y
  • the input tensor X can have a size of [N; H; Wi], where N represents a batch size of input data to be processed (e.g., a batch operation of N images to be convolved); C in represents the number of channels (e.g., an image from a standard digital camera has three channels—red, green, and blue. Each channel can include a 2d matrix having pixel values (e.g., in the range of 0 to 255)); H and Wi represent the height and width of each channel (e.g., 255*255), respectively.
  • the filter or kernel tensor W can have a size of [C out ; C in ; R; S], wherein C out represents the number of output channels of the convolution operation; C in represents the number of input channels of the convolution operation; R and S represent the height and width of each filter or kernel of the convolutional layer, respectively.
  • R and S can be set to have the same value.
  • the output tensor Y can have a size of [N; C out ; H; Wi], with element
  • i is an integer from the range of [0, N) and o is an integer from the range of [0, C out ).
  • a convolution operation of a convolutional layer there are several ways to implement a convolution operation of a convolutional layer, such as calculation by definition (e.g., as given by Equation (1)), conversion to matrix multiplication, or using Winograd fast algorithm. For example, when the size of each convolution kernel or filter is 1 ⁇ 1 (that is, R and S are equal to 1 at the same time), the above convolution operation can be converted to matrix multiplication. In some implementations, for example, in cases where R>1 or S>1, the convolution operation can be implemented based on Winograd fast algorithm.
  • a minimal filtering algorithm F (m ⁇ n; r ⁇ s) can be found that only requires (m+r ⁇ 1) (n+s ⁇ 1) multiplications to complete the convolution operation.
  • a given input channel of size H ⁇ Wi can be divided into overlapping sub-blocks, each sub-block having a size of m ⁇ n.
  • a convolution calculation can be performed on each sub-block separately to generate the corresponding output channel (recall the above assumption of the stride being 1).
  • the width R and height S of the convolution filter or kernel in the convolution layer are generally set to the same value.
  • m and n are set to the same values as well.
  • the steps 1 - 23 of the example algorithm 200 show an example implementation of the Winograd algorithm for a convolution operation of a single convolutional layer of a CNN.
  • the CNN is characterized by ⁇ o,c ⁇ R r ⁇ r , the filter or kernel between the c th input channel and o th output channel, and a tensor W including parameters or weights of the filters or kernels of the CNN.
  • Tensor X represents the input data to the CNN.
  • ⁇ c,p ⁇ R ⁇ represents the p th sub-block of the c th input channel.
  • Y o,p ⁇ R ⁇ represents the p th sub-block of the o th output channel.
  • B T and A T are transform matrices corresponding to input X, and given by:
  • the example algorithm 200 (including steps 1 - 23 ) is executed for each input channel (e.g., a channel of an image) for each convolutional layer of the CNN. For example, if a CNN includes 5 convolutional layers, for each input channel, the example algorithm 200 is executed 5 times. If 10 channels are input for prediction, the example algorithm 200 will be executed 50 times.
  • the described techniques for fast computation of a CNN can reduce the computation load, especially given multiple input channels.
  • the described techniques for fast computation of a CNN can reuse factors that are common and unchanged during the prediction phase of a machine learning process.
  • the described techniques for fast computation of a CNN identify such factors and pre-compute them.
  • the described fast computation algorithm can pre-compute the tensor U according to steps 1 - 6 because all the parameters needed for such a computation (e.g., W and ⁇ o,c ) are known given a trained CNN and remain unchanged during the prediction phase based on the trained CNN.
  • the tensor U can be extracted from the tensor W by offline processing.
  • the steps 1 - 6 (collectively denoted as 210 ) of the example algorithm 200 only need to be executed once for each convolutional layer of the trained CNN.
  • the steps 1 - 6 of the example algorithm 200 can only be executed once to reduce the amount of computation in real-time model prediction, thereby reducing runtime, while the remaining steps 7 - 23 (collectively denoted as 220 ) are executed 10 times for each convolutional layer of the trained CNN.
  • the computational graph of the trained CNN can be modified, for example, according to the equivalent transform 101 , to the computational graph 105 , as shown in FIG. 1 .
  • the pre-computed tensor U based on the example Winograd algorithm 200 can be an example of the tensor U 125 , which replaces the original weight tensor W 120 of a trained CNN.
  • the convolution operator Wino-Conv2D 135 can be implemented based on steps 7 - 23 of the example Winograd algorithm 200 , which replaces the original convolution operator Conv2D 130 in the original computational graph 100 .
  • the modified computational graph 105 can be used for prediction.
  • the new computational graph 105 relieves the original computational graph 100 from the calculation of the tensor U 125 , the computational load of modified computational graph 105 is reduced, and thus improves the computational efficiency for performing the convolution operation of a CNN.
  • Simulation results have shown that the computation based on the modified computational graph 105 with the pre-computed U 125 and the convolution operator Wino-Conv2D 135 can improve the computation speed as much as 30% compared to the computational graph 100 with the original weight tensor W 120 and the convolution operator Conv2D 130 .
  • FIG. 3 is a flowchart of an example method 300 for fast computation of a CNN, according to an implementation.
  • various steps of method 300 can be run in parallel, in combination, in loops, or in any order.
  • the description that follows generally describes method 300 in the context of the other figures in this description.
  • the method 300 may be performed, for example, by any suitable system, environment, software, and hardware, or a combination of systems, environments, software, and hardware, as appropriate.
  • the method 300 can be performed by a data processing apparatus that is configured to execute machine learning algorithms using CNNs.
  • the data processing apparatus can include or be implemented by one or more of, for example, general-purpose CPUs or hardware accelerators such as GPUs, FPGAs, and even custom ASIC processors.
  • a trained CNN is obtained.
  • Obtaining a trained CNN can include, for example, computing the trained CNN through a training phase or process of a machine learning process (e.g., based on training data or sample inputs), retrieving the trained CNN from a memory or another data store, or receiving the trained CNN from another source (e.g., another data processing apparatus that performs the training process).
  • the trained CNN can receive and analyze input data (e.g., an input image), and predict output data (e.g., respective probabilities of categories or classifications of the input image).
  • the trained CNN has a known network architecture (e.g., an ordered sequence of different layers) defined by known parameters of each layer in the trained CNN.
  • Each layer in the trained CNN can operate on the input data to the trained CNN (either directly or by the virtue of operating on an output of a previous layer in the trained CNN).
  • a layer that operates on data in the trained CNN prior to another layer is referred to as being a prior, previous, or upstream layer relative to the another layer.
  • a layer that operates on data in the trained CNN following or after another layer's operation on data is referred to as a later or downstream layer relative to the another layer.
  • an output of a prior layer is served as an input into the adjacent later layer.
  • the input data to the trained CNN go through each layer of the trained CNN in an order from the beginning of the CNN (e.g., an input layer that receives the input data or the initial layer that operates directly on the input data) to the ending of the CNN (e.g., an output layer that outputs a result of the CNN).
  • the trained CNN includes one or more convolutional layers.
  • Each of the one or more convolutional layers includes a respective one or more filters (or kernels) with known filter parameters (e.g., the number of filters, filter sizes, parameter values of the filter, and connection weights).
  • Different convolutional layers may include the same or different number of filters.
  • Each filter is defined by known filter parameters or weights.
  • a filter can be represented by a matrix, such as the filter ⁇ o,c ⁇ R r ⁇ r as described with respect to FIG. 2 .
  • the values of the filter matrix and connection weights are learned and known during the training process, and these values will not change when using the trained CNN, for example, for prediction based on input data.
  • the filters and/or the connection weights of a convolutional layer can be collectively represented by a tensor.
  • each convolutional layer of the trained CNN can be represented by the filter matrices and a weight tensor (e.g., the tensor W as described with respect to FIG. 2 ) that includes parameters or weights of the filters or kernels of the CNN. From 310 , method 300 proceeds to 320 .
  • a reusable factor for each of the one or more convolutional layers can be pre-computed based on the known filter parameters of the trained CNN. For example, for each of the one or more convolutional layers, a reusable factor that only depends on known, unchanged parameters can be identified and computed independently of any input data to the CNN.
  • the tensor U as described with respect to the example algorithm 200 in FIG. 2 , is an example of the reusable factor for each of the one or more convolutional layers of the trained CNN.
  • the tensor U can be computed according to steps 1 - 6 of the example algorithm 200 because all the parameters needed for such a computation (e.g., W and ⁇ o,c ) are known given the trained CNN and remain unchanged during the prediction phase based on the trained CNN.
  • pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained CNN includes computing the reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained CNN offline regardless of the input data to the trained CNN.
  • the reusable factor only needs to be computed once, for example, by offline processing and can be retrieved and reused during runtime computation. From 320 , method 300 proceeds to 330 .
  • the pre-computed reusable factor for each of the one or more convolutional layers can be saved, for example, in a data store for ease of later retrieval when using the trained CNN, for example, for prediction.
  • the saved pre-computed reusable factor can be shared, published or otherwise transmitted to other data processing devices. From 330 , method 300 proceeds to 340 .
  • the input data can include, for example, image data of one or more images (e.g., a photo including a face of a person), or other types of input data.
  • Each image can include one or more channels.
  • an image from a standard digital camera can be regarded as having three channels red, green and blue.
  • a grayscale image can be regarded as having a single channel.
  • the input data can be represented by a tensor that can include scalars, vectors, matrices, and high-dimensional tensors.
  • a grayscale image can be represented by a 2d matrix of pixel values.
  • the value of each pixel in the matrix can range, for example, from 0 to 255, with zero indicating black and 255 indicating white.
  • the value of each pixel is normalized by dividing it by 255.
  • Each channel of a color image can include a 2d matrix having pixel values, for example, in the range 0 to 255 or in the range 0 to 1.0 after normalization.
  • the 2d-matrices stacked over each other can form a tensor that represents the color image.
  • the image data can include a tensor that includes multiple images.
  • the input tensor X as described with respect to the example algorithm 200 in FIG. 2 is an example of the input data.
  • the input data can be divided into sub-blocks or channels, for example, according to the techniques as described with respect to the example algorithm 200 .
  • the convolution operation can be performed on each of the sub-blocks or channels of the data, for example, according to the Winograd algorithm or another method.
  • the input data to CNN can go through each layer of the trained CNN according to the known network architecture of the trained CNN. From 340 , method 300 proceeds to 350 .
  • an output of the each of the one or more convolutional layers is computed based on the pre-computed reusable factor and the input data.
  • the output of the Layer k includes a result of a convolution operation performed between an input to the Layer k and the filters of the Layer k.
  • the input to the Layer k can include the input data to the trained CNN, an output of a previous layer (e.g., Layer k ⁇ 1) of the trained CNN, or a combination of them.
  • the output of the Layer k can be computed using a Winograd convolutional operator (e.g., the Wino-Conv2D 135 as described with respect to FIG.
  • the output of the Layer k can be computed according to the Winograd minimal filtering algorithm (e.g., the example algorithm 200 described with respect to FIG. 2 ). From 350 , method 300 proceeds to 360 .
  • the pre-computed reusable factor e.g., the tensor U 125
  • the input data either directly or indirectly by the virtue of operating on the output of a previous layer.
  • the output of the Layer k can be computed according to the Winograd minimal filtering algorithm (e.g., the example algorithm 200 described with respect to FIG. 2 ). From 350 , method 300 proceeds to 360 .
  • output data of the trained convolutional network is determined based on the output of the each of the one or more convolutional layers.
  • determining output data of the trained convolutional network includes generating, predicting, or otherwise computing the output data of the trained CNN.
  • the output data can include a prediction, classification, or other features or attributes derived from the input data by the operations of the trained CNN.
  • the output data can include a vector of probabilities of possible categories (e.g., a person, a dog, a car, or a tree) of the input image.
  • the output data can include an identification of an object in the input image (e.g., for face recognition).
  • the output data can include an identification or categorization of vehicular damages based on input photos of a vehicle.
  • the output data can include an enhanced or filtered version of the input data (e.g., a sharpened or blurred version of an input photo).
  • the output data of the trained convolutional network are determined based on the output of the each of the one or more convolutional layers in that the output data of the trained convolutional network are determined based on one or more of the output of the each of the one or more convolutional layers (e.g., the output of the Layer k, for k ⁇ [1, L], where L is the total number of the layers in the trained CNN.
  • the output data of the trained convolutional network is determined based on the output of the last layer, Layer L, of the trained CNN, where the output of the Layer L is determined based on the output of the previous layer, Layer L ⁇ 1, and so on, until tracing back to the input data to the trained CNN.
  • the output data of the trained convolutional network can be determined based on the output of the each of the one or more convolutional layers in another manner. From 360 , method 300 proceeds to 370 .
  • the output data of the trained convolutional network can be output, for example, via a user interface (e.g., a graphical user interface).
  • a user interface e.g., a graphical user interface
  • the output data of the trained convolutional network can be represented in a table, a graph, a text, or another format and displayed to a user via a screen or another user interface.
  • the output data of the trained convolutional network can be saved, transmitted, or otherwise output to another device (e.g., a storage device or another data processing apparatus for further processing).
  • method 300 goes back to 340 to determine if any input data (e.g., referred to as second input data) to the trained CNN are received.
  • the second input data to the trained CNN can include additional or updated input data (e.g., another set of images) to the trained CNN for analyzing and prediction. Accordingly, method 300 can proceed to 350 for computing a second output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the second input data; and then to 360 for determining second output data of the trained convolutional network based on the second output of the each of the one or more convolutional layers.
  • FIG. 4 is a block diagram of an example computer system 400 used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures, as described in the instant disclosure, according to an implementation.
  • the illustrated computer 402 is intended to encompass any computing device such as a server, desktop computer, laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA), tablet computing device, one or more processors within these devices, or any other suitable processing device, including physical or virtual instances (or both) of the computing device.
  • any computing device such as a server, desktop computer, laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA), tablet computing device, one or more processors within these devices, or any other suitable processing device, including physical or virtual instances (or both) of the computing device.
  • PDA personal data assistant
  • the computer 402 may include a computer that includes an input device, such as a keypad, keyboard, touch screen, or other device that can accept user information, and an output device that conveys information associated with the operation of the computer 402 , including digital data, visual, or audio information (or a combination of information), or a graphical-type user interface (UI) (or GUI).
  • an input device such as a keypad, keyboard, touch screen, or other device that can accept user information
  • an output device that conveys information associated with the operation of the computer 402 , including digital data, visual, or audio information (or a combination of information), or a graphical-type user interface (UI) (or GUI).
  • UI graphical-type user interface
  • the computer 402 can serve in a role as a client, network component, a server, a database or other persistency, or any other component (or a combination of roles) of a computer system for performing the subject matter described in the instant disclosure.
  • the illustrated computer 402 is communicably coupled with a network 430 .
  • one or more components of the computer 402 may be configured to operate within environments, including cloud-computing-based, local, global, or other environment (or a combination of environments).
  • the computer 402 is an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the described subject matter. According to some implementations, the computer 402 may also include or be communicably coupled with an application server, e-mail server, web server, caching server, streaming data server, or other server (or a combination of servers).
  • an application server e-mail server, web server, caching server, streaming data server, or other server (or a combination of servers).
  • the computer 402 can receive requests over network 430 from a client application (for example, executing on another computer 402 ) and respond to the received requests by processing the received requests using an appropriate software application(s).
  • requests may also be sent to the computer 402 from internal users (for example, from a command console or by other appropriate access method), external or third-parties, other automated applications, as well as any other appropriate entities, individuals, systems, or computers.
  • Each of the components of the computer 402 can communicate using a system bus 403 .
  • any or all of the components of the computer 402 may interface with each other or the interface 404 (or a combination of both), over the system bus 403 using an application programming interface (API) 412 or a service layer 413 (or a combination of the API 412 and service layer 413 ).
  • the API 412 may include specifications for routines, data structures, and object classes.
  • the API 412 may be either computer-language independent or dependent and refer to a complete interface, a single function, or even a set of APIs.
  • the service layer 413 provides software services to the computer 402 or other components (whether or not illustrated) that are communicably coupled to the computer 402 .
  • the functionality of the computer 402 may be accessible for all service consumers using this service layer.
  • Software services, such as those provided by the service layer 413 provide reusable, defined functionalities through a defined interface.
  • the interface may be software written in JAVA, C++, or other suitable language providing data in extensible markup language (XML) format or other suitable format.
  • XML extensible markup language
  • alternative implementations may illustrate the API 412 or the service layer 413 as stand-alone components in relation to other components of the computer 402 or other components (whether or not illustrated) that are communicably coupled to the computer 402 .
  • any or all parts of the API 412 or the service layer 413 may be implemented as child or sub-modules of another software module, enterprise application, or hardware module without departing from the scope of this disclosure.
  • the computer 402 includes an interface 404 . Although illustrated as a single interface 404 in FIG. 4 , two or more interfaces 404 may be used according to particular needs, desires, or particular implementations of the computer 402 .
  • the interface 404 is used by the computer 402 for communicating with other systems that are connected to the network 430 (whether illustrated or not) in a distributed environment.
  • the interface 404 includes logic encoded in software or hardware (or a combination of software and hardware) and is operable to communicate with the network 430 . More specifically, the interface 404 may include software supporting one or more communication protocols associated with communications such that the network 430 or interface's hardware is operable to communicate physical signals within and outside of the illustrated computer 402 .
  • the computer 402 includes a processor 405 . Although illustrated as a single processor 405 in FIG. 4 , two or more processors may be used according to particular needs, desires, or particular implementations of the computer 402 . Generally, the processor 405 executes instructions and manipulates data to perform the operations of the computer 402 and any algorithms, methods, functions, processes, flows, and procedures as described in the instant disclosure.
  • the computer 402 also includes a database 406 that can hold data for the computer 402 or other components (or a combination of both) that can be connected to the network 430 (whether illustrated or not).
  • database 406 can be an in-memory, conventional, or other type of database storing data consistent with this disclosure.
  • database 406 can be a combination of two or more different database types (for example, a hybrid in-memory and conventional database) according to particular needs, desires, or particular implementations of the computer 402 and the described functionality.
  • two or more databases can be used according to particular needs, desires, or particular implementations of the computer 402 and the described functionality.
  • database 406 is illustrated as an integral component of the computer 402 , in alternative implementations, database 406 can be external to the computer 402 . As illustrated, the database 406 holds one or more trained CNNs 416 , pre-computed reusable factors 418 of each convolutional layer of the one or more trained CNNs 416 , and Winograd algorithm 426 , for fast computation of a CNN.
  • the computer 402 also includes a memory 407 that can hold data for the computer 402 or other components (or a combination of both) that can be connected to the network 430 (whether illustrated or not).
  • Memory 407 can store any data consistent with this disclosure.
  • memory 407 can be a combination of two or more different types of memory (for example, a combination of semiconductor and magnetic storage) according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. Although illustrated as a single memory 407 in FIG. 4 , two or more memories 407 (of the same or combination of types) can be used according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. While memory 407 is illustrated as an integral component of the computer 402 , in alternative implementations, memory 407 can be external to the computer 402 .
  • the application 408 is an algorithmic software engine providing functionality according to particular needs, desires, or particular implementations of the computer 402 , particularly with respect to functionality described in this disclosure.
  • application 408 can serve as one or more components, modules, or applications.
  • the application 408 may be implemented as multiple applications 408 on the computer 402 .
  • the application 408 can be external to the computer 402 .
  • the computer 402 can also include a power supply 414 .
  • the power supply 414 can include a rechargeable or non-rechargeable battery that can be configured to be either user- or non-user-replaceable.
  • the power supply 414 can include power-conversion or management circuits (including recharging, standby, or other power management functionality).
  • the power-supply 414 can include a power plug to allow the computer 402 to be plugged into a wall socket or other power source to, for example, power the computer 402 or recharge a rechargeable battery.
  • computers 402 there may be any number of computers 402 associated with, or external to, a computer system containing computer 402 , each computer 402 communicating over network 430 .
  • client the term “client,” “user,” and other appropriate terminology may be used interchangeably, as appropriate, without departing from the scope of this disclosure.
  • this disclosure contemplates that many users may use one computer 402 , or that one user may use multiple computers 402 .
  • Described implementations of the subject matter can include one or more features, alone or in combination.
  • a computer-implemented method including: obtaining, by a data processing apparatus, a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing, by the data processing apparatus, a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving, by the data processing apparatus, input data to the trained convolutional neural network; computing, by the data processing apparatus, an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining, by the data processing apparatus, output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • a non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations including: obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • a computer-implemented system including: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing instructions, that when executed by the one or more computers, perform operations including: obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • a first feature combinable with any of the following features, wherein pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network includes computing the reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network offline regardless of the input data to the trained convolutional neural network.
  • a second feature combinable with any of the previous or following features, further including saving the pre-computed reusable factor for each of the one or more convolutional layers.
  • a third feature combinable with any of the previous or following features, wherein computing an output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the input data includes computing an output of the each of the one or more convolutional layers according to a Winograd minimal filtering algorithm with the pre-computed reusable factor and the input data.
  • a fourth feature combinable with any of the previous or following features, wherein the input data to the trained convolutional neural network includes one or more images.
  • a fifth feature, combinable with any of the previous or following features, the method or the operations further including: receiving, by the data processing apparatus, second input data to the trained convolutional neural network; computing, by the data processing apparatus, a second output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the second input data; and predicting, by the data processing apparatus, second output data of the trained convolutional network based on the second output of the each of the one or more convolutional layers.
  • a sixth feature combinable with any of the previous or following features, the method or the operations further including outputting the output data of the trained convolutional network via a user interface.
  • Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Software implementations of the described subject matter can be implemented as one or more computer programs, that is, one or more modules of computer program instructions encoded on a tangible, non-transitory, computer-readable computer-storage medium for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded in/on an artificially generated propagated signal, for example, a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • the computer-storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of computer-storage mediums.
  • Configuring one or more computers means that the one or more computers have installed hardware, firmware, or software (or combinations of hardware, firmware, and software) so that when the software is executed by the one or more computers, particular computing operations are performed.
  • real-time means that an action and a response are temporally proximate such that an individual perceives the action and the response occurring substantially simultaneously.
  • time difference for a response to display (or for an initiation of a display) of data following the individual's action to access the data may be less than 1 ms, less than 1 sec., or less than 5 secs.
  • data processing apparatus refers to data processing hardware and encompass all kinds of apparatus, devices, and machines for processing data, including by way of example, a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be, or further include special purpose logic circuitry, for example, a central processing unit (CPU), an FPGA (field programmable gate array), or an ASIC (application-specific integrated circuit).
  • the data processing apparatus or special purpose logic circuitry may be hardware- or software-based (or a combination of both hardware- and software-based).
  • the apparatus can optionally include code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of execution environments.
  • code that constitutes processor firmware for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of execution environments.
  • the present disclosure contemplates the use of data processing apparatuses with or without conventional operating systems, for example LINUX, UNIX, WINDOWS, MAC OS, ANDROID, IOS, or any other suitable conventional operating system.
  • a computer program which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, for example, one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, for example, files that store one or more modules, sub-programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • While portions of the programs illustrated in the various figures are shown as individual modules that implement the various features and functionality through various objects, methods, or other processes, the programs may instead include a number of sub-modules, third-party services, components, libraries, and such, as appropriate. Conversely, the features and functionality of various components can be combined into single components, as appropriate. Thresholds used to make computational determinations can be statically, dynamically, or both statically and dynamically determined.
  • the methods, processes, or logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the methods, processes, or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, for example, a CPU, an FPGA, or an ASIC.
  • Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors, both, or any other kind of CPU.
  • a CPU will receive instructions and data from and write to a memory.
  • the essential elements of a computer are a CPU, for performing or executing instructions, and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, receive data from or transfer data to, or both, one or more mass storage devices for storing data, for example, magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, for example, magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, for example, a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device, for example, a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data includes all forms of permanent/non-permanent or volatile/non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, for example, random access memory (RAM), read-only memory (ROM), phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic devices, for example, tape, cartridges, cassettes, internal/removable disks; magneto-optical disks; and optical memory devices, for example, digital video disc (DVD), CD-ROM, DVD+/ ⁇ R, DVD-RAM, DVD-ROM, HD-DVD, and BLURAY, and other optical memory technologies.
  • RAM random access memory
  • ROM read-only memory
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • EPROM erasable programmable
  • the memory may store various objects or data, including caches, classes, frameworks, applications, modules, backup data, jobs, web pages, web page templates, data structures, database tables, repositories storing dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto. Additionally, the memory may include any other appropriate data, such as logs, policies, security or access data, reporting files, as well as others.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • implementations of the subject matter described in this specification can be implemented on a computer having a display device, for example, a CRT (cathode ray tube), LCD (liquid crystal display), LED (Light Emitting Diode), or plasma monitor, for displaying information to the user and a keyboard and a pointing device, for example, a mouse, trackball, or trackpad by which the user can provide input to the computer.
  • a display device for example, a CRT (cathode ray tube), LCD (liquid crystal display), LED (Light Emitting Diode), or plasma monitor
  • a keyboard and a pointing device for example, a mouse, trackball, or trackpad by which the user can provide input to the computer.
  • Input may also be provided to the computer using a touchscreen, such as a tablet computer surface with pressure sensitivity, a multi-touch screen using capacitive or electric sensing, or other type of touchscreen.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
  • GUI graphical user interface
  • a GUI may be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, a GUI may represent any graphical user interface, including but not limited to, a web browser, a touch screen, or a command line interface (convolutional layerI) that processes information and efficiently presents the information results to the user.
  • a GUI may include one or more user interface (UI) elements, some or all associated with a web browser, such as interactive fields, pull-down lists, and buttons. These and other UI elements may be related to or represent the functions of the web browser.
  • UI user interface
  • Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, for example, as a data server, or that includes a middleware component, for example, an application server, or that includes a front-end component, for example, a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of wireline or wireless digital data communication (or a combination of data communication), for example, a communication network.
  • Examples of communication networks include a local area network (LAN), a radio access network (RAN), a metropolitan area network (MAN), a wide area network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a wireless local area network (WLAN) using, for example, 802.11 a/b/g/n or 802.20 (or a combination of 802.11x and 802.20 or other protocols consistent with this disclosure), all or a portion of the Internet, or any other communication system or systems at one or more locations (or a combination of communication networks).
  • the network may communicate with, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, or other suitable information (or a combination of communication types) between network addresses.
  • IP Internet Protocol
  • ATM Asynchronous Transfer Mode
  • the computing system can include clients and servers.
  • a client and a server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • any claimed implementation is considered to be applicable to at least a computer-implemented method; a non-transitory, computer-readable medium storing computer-readable instructions to perform the computer-implemented method; and a computer system including a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method or the instructions stored on the non-transitory, computer-readable medium.

Abstract

A computer-implemented method includes obtaining a trained convolutional neural network comprising one or more convolutional layers, each of the one or more convolutional layers comprising a plurality of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of PCT Application No. PCT/CN2018/111655, filed on Oct. 24, 2018, which is hereby incorporated by reference in its entirety.
  • BACKGROUND
  • A convolutional neural network (CNN or ConvNet) is a type of artificial neural network in machine learning. It is commonly used in analyzing visual imagery, for example, image recognition and classification. For example, in a training phase of a machine learning process, a CNN can be trained or learned based on training data. Then, in a prediction phase of a machine learning process, the trained CNN serves as a model that receives input data and outputs predictions or decisions based on processing and analyzing the input data.
  • SUMMARY
  • The present disclosure describes fast computation of a convolutional neural network (CNN).
  • In an implementation, a computer-implemented method includes obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • The previously described implementation is implementable using a computer-implemented method; a non-transitory, computer-readable medium storing computer-readable instructions to perform the computer-implemented method; and a computer-implemented system including a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method/the instructions stored on the non-transitory, computer-readable medium.
  • The subject matter described in this specification can be implemented in particular implementations, so as to realize one or more of the following advantages. First, the described subject matter elevates a CNN model by performing equivalent transformation or conversion of a computational graph to streamline the network structure of the CNN, and thus allows optimization of implementation of the CNN in both device-independent and device-dependent manner. Example device-independent optimizations can include redundant operation removal (e.g. removing identify operators in a deep learning framework, which only performs data transmission from the input tensor to the output tensors with no change of data content), and layer/operation merging to reduce computational complexity (e.g., merging a convolutional layer with a batch normalization layer). Example device-dependent optimizations can include improving hardware efficiency by merging operations (e.g., merging a convolutional layer with a bias addition operation that immediately follows the former) to better exploit hardware computing capability and flexible deploying operations on different underlying architectures to maximize CNN throughput. Second, the described subject matter reduces the computation load of data prediction based on a trained CNN and improves the prediction speed, and thus reduces network latency and improves throughput of the CNN. Third, the described subject matter requires less computation power. Fourth, the described subject matter can combine multiple operators into a new operator to take advantage of underlying hardware accelerators (such as GPUs, FPGAs or ASIC chips). For example, in CNN, a convolutional layer is typically followed by a bias addition operation or a batch normalization layer. In the case of a bias addition operation, if considering the convolution and bias addition as two separate operations, the convolution can be computed on GPUs and then bias values can be added to the results of the preceding convolution on GPUs. In this way, the computing capability of GPUs are not fully exploited because of the small amount of computation in bias addition operation. Instead, in the described subject matter, the convolution and bias addition operations can be combined into a single one in some implementations. In this single operation, the corresponding bias value can be added to the result of convolution directly in the same GPU kernel launch, thus leading to better speed. Similarly, in the case of batch normalization, the combination of a convolutional layer and a batch normalization can be equivalently transformed into a combination of a convolutional layer and a bias addition offline, and then the aforementioned device-dependent optimizations can be applied on convolutional layers and bias addition operations to further take advantage of GPU power to improve computation speed. These examples are applicable as well to other processors or accelerators than GPUs. Other advantages will be apparent to those of ordinary skill in the art.
  • The details of one or more implementations of the subject matter of this specification are set forth in the Detailed Description, the claims, and the accompanying drawings, and the claims. Other features, aspects, and advantages of the subject matter will become apparent from the Detailed Description, the claims, and the accompanying drawings.
  • DESCRIPTION OF DRAWINGS
  • FIG. 1 is a block diagram illustrating an example equivalent transformation of two computational graphs of a convolutional neural network (CNN), according to an implementation of the present disclosure.
  • FIG. 2 is a screenshot illustrating an example pseudorandom code of fast computation of a CNN, according to an implementation of the present disclosure.
  • FIG. 3 is a flow chart illustrating an example method for fast computation of a CNN, according to an implementation of the present disclosure.
  • FIG. 4 is a block diagram illustrating an example computer system used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures as described in the instant disclosure, according to an implementation of the present disclosure.
  • Like reference numbers and designations in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • The following detailed description describes fast computation of a convolutional neural network (CNN), and is presented to enable any person skilled in the art to make and use the disclosed subject matter in the context of one or more particular implementations. Various modifications, alterations, and permutations of the disclosed implementations can be made and will be readily apparent to those or ordinary skill in the art, and the general principles defined may be applied to other implementations and applications, without departing from scope of the disclosure. In some instances, details unnecessary to obtain an understanding of the described subject matter may be omitted so as to not obscure one or more described implementations with unnecessary detail and inasmuch as such details are within the skill of one of ordinary skill in the art. The present disclosure is not intended to be limited to the described or illustrated implementations, but to be accorded the widest scope consistent with the described principles and features.
  • A convolutional neural network (CNN or ConvNet) is one of the most representative network structures and technological innovations for deep learning. It has achieved great success in the field of imagery and is also widely used to solve problems in other fields. A CNN can include one or more of an input layer, a convolutional layer, an activation function, a batch normalization, a pooling layer, a fully connected layer, and an output layer. Among them, the convolutional layer typically involves the most computational load and may consume the longest computation time of the entire CNN. A convolutional layer can include one or more filters (also referred to as kernels). The convolutional layer can receive input data, perform a convolution operation of the input data with each of one or more filters of the convolutional layer, and generate output data of the convolutional layer. In some instances, a CNN can include tens of convolutional layers.
  • The described techniques can help accelerate the convolution operation, which is the core operation of the convolutional layer. In turn, the described techniques can improve the computational efficiency and reduce the computational load of a CNN.
  • The described techniques can have a variety of applications. For example, the described techniques can be applied in face recognition in areas such as unmanned supermarkets, unmanned banks, security protection, and smart cities. For example, deep convolutional neural networks have been used in face recognition. The described techniques can help deal with tasks of face identification, especially among a large population. The described techniques can improve response time of face recognition based on a trained CNN model. The described techniques can reduce the processing time of each request of face recognition. In turn, system throughput can be increased and operation costs can be reduced without increasing computing resources.
  • As another example, the described techniques can be used in auto insurance. The described techniques can automatically identify a surface damage of a vehicle based on deep convolutional neural network image technology. For example, after a car accident, the car surface damage can be photographed, and then uploaded to an insurance company's claim server, which can perform automatic damage identification and compensation quota valuation. The described techniques can be used by the insurance claim server to perform automatic damage identification based on deep convolutional neural network technology. The described techniques can allow the insurance company to provide faster and higher throughput without increasing equipment budget.
  • Aside from the above-discussed example, the described techniques can benefit many other applications that make use of the CNN technology.
  • In some implementations, the described techniques can be implemented using a computational graph. Computational graphs can be used to represent machine learning algorithms, especially in the field of deep learning. A computational graph can be a directed graph that includes multiple nodes, where the nodes correspond to operations or variables. Variables can feed their value into operations, and operations can feed their output into other operations. This way, the node in the computational graph can define a function of the variables. The values that are fed into the nodes (i.e., input) and come out of the nodes (i.e., output) can be represented by tensors. A tensor can be regarded as a multi-dimensional array. A tensor can encompass scalars, vectors, matrices, and tensors of a higher rank or dimensions. Using tensors to represent input and output data of the variable and operator can help achieve automatic differentiation, dynamic memory management, and other functionalities.
  • For an operator, the computational graph only needs to define the function of the operator. It is not necessary to specify the specific implementation of each operator. Therefore, the computational graph provides the flexibility such that the operator can be executed or otherwise implemented on one or more of a CPU or a hardware accelerator such as GPU, FPGA, or AI chip.
  • The storage and access to the data variables can be either local or remote. Computational graphs can be used for model training, model prediction or inference, or other phases of a machine learning process.
  • FIG. 1 is a block diagram illustrating an example equivalent transformation 101 of two computational graphs 100 and 105 of a CNN, according to an implementation of the present disclosure. The two computational graphs 100 and 105 can represent the same CNN. For simplicity, the two computational graphs 100 and 105 illustrate operations or functions of a single convolutional layer of the CNN. A CNN can include multiple layers and can be represented by a computational graph accordingly.
  • As illustrated, tensor X 110 is an input and tensor Z 170 is an output of the convolutional layer of the CNN represented by the computational graph 100, respectively. The computational graph 100 includes two nodes 130 and 150. Each node corresponds to a variable or an operator, which can be regarded as a fine-grained basic operation of a neural network. The node 130 represents a convolution operator, Conv2D. The Conv2D 130 can represent the convolution operation of the input data tensor X 110 with a tensor W 120, where the tensor W 120 represents weights or parameters of filters of the convolutional layer of the CNN. After the convolution operation, the Conv2D 130 can output tensor Y 140. The output tensor Y 140 and a bias 160 can serve as the input to the node 150, which represents the operation of adding the bias 160 to the output data Y 140, resulting in the output Z 170 of the convolutional layer of the CNN.
  • Similarly, tensor X′ 115 is an input and tensor Z′ 175 is an output of the convolutional layer of the CNN represented by the computational graph 155, respectively. The computational graph 155 includes two nodes 135 and 155. The node 135 represents a convolution operator, Wino-Conv2D. The Wino-Conv2D 135 can represent a modified convolution operation of the input data tensor X′ 115 with a tensor W 125, where the tensor W 125 represents modified weights or parameters of filters of the convolutional layer of the CNN. After the convolution operation, the Wino-Conv2D 135 can output tensor Y′ 145. The output tensor Y′ 145 and a bias 165 can serve as the input to the node 155, which represents the operation of adding the bias 165 to the output data Y′ 145, resulting in the output Z′ 175 of the convolutional layer of the CNN.
  • In some implementations, the computational graph 105 is an equivalent transformation of the computational graph 100. Given the same input (i.e., X 110=X′ 115), the two computational graphs 100 and 105 can generate the same output (i.e., Z 170=Z′ 175).
  • In some implementations, compared to the computational graph 100, the computational graph 105 can represent an improved or optimized convolution operation for fast computation of the CNN. For example, after a training phase of a machine learning process, parameters of the nodes (e.g., the values of filter or kernel parameter in the tensor W 120 or tensor U 125) of the computational graph have been trained and known. In the prediction phase of a machine learning process, the values of these parameters remain unchanged. In other words, no matter how the input tensor X′ 115 changes, the value of the tensor U 125 will not change. As such, the tensor U 125 can be computed after the training, before performing prediction based on any input data X′ 115. By computing the tensor U 125 in advance and reusing the pre-computed tensor U 125 for any input data X′ 115 to the convolutional layer of the CNN, the computational load of the CNN can be reduced, especially for prediction of multiple input to a CNN.
  • As such, in the computational graph 105, the new convolution operator Wino-Conv2D 135 can replace the original convolution operator Conv2D 130 in the original computational graph 100. The new convolution operator Wino-Conv2D 135 uses the tensor U 125 instead of the original tensor W 120 as the filter or kernel parameter. The computational graph 105 can achieve improved computational efficiency while preserving the equivalence with the original computational graph 100.
  • FIG. 2 is a screenshot illustrating an example pseudo code of an example algorithm 200 for fast computation of a CNN, according to an implementation of the present disclosure. The example algorithm 200 is based on Winograd minimal filtering algorithm, a fast algorithm for CNNs.
  • The convolution operation of a convolutional layer of the CNN can be defined as a specified correlation between an input (e.g., represented by an input tensor X) and a filter or kernel of the convolutional layer (represented by a filter or kernel tensor W), resulting in an output (e.g., represented by an output tensor Y). For simplicity, consider the convolution operation uses a stride of 1. For a given convolutional layer, the input tensor X can have a size of [N; H; Wi], where N represents a batch size of input data to be processed (e.g., a batch operation of N images to be convolved); Cin represents the number of channels (e.g., an image from a standard digital camera has three channels—red, green, and blue. Each channel can include a 2d matrix having pixel values (e.g., in the range of 0 to 255)); H and Wi represent the height and width of each channel (e.g., 255*255), respectively. The filter or kernel tensor W can have a size of [Cout; Cin; R; S], wherein Cout represents the number of output channels of the convolution operation; Cin represents the number of input channels of the convolution operation; R and S represent the height and width of each filter or kernel of the convolutional layer, respectively. Typically, R and S can be set to have the same value.
  • The output tensor Y can have a size of [N; Cout; H; Wi], with element

  • Y i,o,x,yc=1 C in ΣR u=1ΣS v=1 X i,c,y+u,x+v W o,c,u,v   (1)
  • where i is an integer from the range of [0, N) and o is an integer from the range of [0, Cout).
  • There are several ways to implement a convolution operation of a convolutional layer, such as calculation by definition (e.g., as given by Equation (1)), conversion to matrix multiplication, or using Winograd fast algorithm. For example, when the size of each convolution kernel or filter is 1×1 (that is, R and S are equal to 1 at the same time), the above convolution operation can be converted to matrix multiplication. In some implementations, for example, in cases where R>1 or S>1, the convolution operation can be implemented based on Winograd fast algorithm.
  • According to the Winograd fast algorithm, for convolution of an input channel with a convolution filter or kernel with a size of r×s and an output size of m×n, a minimal filtering algorithm F (m×n; r×s) can be found that only requires (m+r−1) (n+s−1) multiplications to complete the convolution operation. For example, a given input channel of size H×Wi can be divided into overlapping sub-blocks, each sub-block having a size of m×n. A convolution calculation can be performed on each sub-block separately to generate the corresponding output channel (recall the above assumption of the stride being 1). As described above, the width R and height S of the convolution filter or kernel in the convolution layer are generally set to the same value. Typically, for sub-block partitioning, m and n are set to the same values as well.
  • For simplicity, the example algorithm 200 considers the case where m=n and r=s. That is, a minimal filtering algorithm F(m m; r r) is considered. The steps 1-23 of the example algorithm 200 show an example implementation of the Winograd algorithm for a convolution operation of a single convolutional layer of a CNN. The CNN is characterized by ωo,c∈Rr×r, the filter or kernel between the cth input channel and oth output channel, and a tensor W including parameters or weights of the filters or kernels of the CNN. Tensor X represents the input data to the CNN.
  • P = N H m Wi m
  • represents the number of input channels. α=m+r−1 represents the size of input sub-block (adjacent sub-blocks have r−1 overlapping elements). χc,p∈Rα×α represents the pth sub-block of the cth input channel. Yo,p∈Rα×α represents the pth sub-block of the oth output channel. BT and AT are transform matrices corresponding to input X, and given by:
  • A T = [ 1 1 1 0 0 1 - 1 - 1 ] ( 2 ) and B T = [ 1 0 - 1 0 0 1 1 0 0 - 1 1 0 0 1 0 - 1 ] ( 3 )
  • In existing techniques for computation of a CNN based on the Winograd algorithm, the example algorithm 200 (including steps 1-23) is executed for each input channel (e.g., a channel of an image) for each convolutional layer of the CNN. For example, if a CNN includes 5 convolutional layers, for each input channel, the example algorithm 200 is executed 5 times. If 10 channels are input for prediction, the example algorithm 200 will be executed 50 times.
  • Unlike the existing techniques, the described techniques for fast computation of a CNN can reduce the computation load, especially given multiple input channels. The described techniques for fast computation of a CNN can reuse factors that are common and unchanged during the prediction phase of a machine learning process. The described techniques for fast computation of a CNN identify such factors and pre-compute them.
  • As such, the computation of those reusable factors only needs to be computed once, regardless of the values of the input data. More specifically, in the example algorithm 200, the described fast computation algorithm can pre-compute the tensor U according to steps 1-6 because all the parameters needed for such a computation (e.g., W and ωo,c) are known given a trained CNN and remain unchanged during the prediction phase based on the trained CNN. For example, the tensor U can be extracted from the tensor W by offline processing. Accordingly, the steps 1-6 (collectively denoted as 210) of the example algorithm 200 only need to be executed once for each convolutional layer of the trained CNN. For example, if 10 channels are input for prediction, the steps 1-6 of the example algorithm 200 can only be executed once to reduce the amount of computation in real-time model prediction, thereby reducing runtime, while the remaining steps 7-23 (collectively denoted as 220) are executed 10 times for each convolutional layer of the trained CNN.
  • With the pre-computed tensor U based on the example Winograd algorithm 200, the computational graph of the trained CNN can be modified, for example, according to the equivalent transform 101, to the computational graph 105, as shown in FIG. 1. For instance, the pre-computed tensor U based on the example Winograd algorithm 200 can be an example of the tensor U 125, which replaces the original weight tensor W 120 of a trained CNN. The convolution operator Wino-Conv2D 135 can be implemented based on steps 7-23 of the example Winograd algorithm 200, which replaces the original convolution operator Conv2D 130 in the original computational graph 100. The modified computational graph 105 can be used for prediction. Since the new computational graph 105 relieves the original computational graph 100 from the calculation of the tensor U 125, the computational load of modified computational graph 105 is reduced, and thus improves the computational efficiency for performing the convolution operation of a CNN. Simulation results have shown that the computation based on the modified computational graph 105 with the pre-computed U 125 and the convolution operator Wino-Conv2D 135 can improve the computation speed as much as 30% compared to the computational graph 100 with the original weight tensor W 120 and the convolution operator Conv2D 130.
  • Note that FIG. 2 shows an example Winograd algorithm for the case of m=n and r=s. The described techniques for fast computation of a CNN can be adapted and applied to other cases including generic CNNs without specific requirements of m=n and r=s.
  • FIG. 3 is a flowchart of an example method 300 for fast computation of a CNN, according to an implementation. In some implementations, various steps of method 300 can be run in parallel, in combination, in loops, or in any order. For clarity of presentation, the description that follows generally describes method 300 in the context of the other figures in this description. However, it will be understood that method 300 may be performed, for example, by any suitable system, environment, software, and hardware, or a combination of systems, environments, software, and hardware, as appropriate. For example, the method 300 can be performed by a data processing apparatus that is configured to execute machine learning algorithms using CNNs. The data processing apparatus can include or be implemented by one or more of, for example, general-purpose CPUs or hardware accelerators such as GPUs, FPGAs, and even custom ASIC processors.
  • At 310, a trained CNN is obtained. Obtaining a trained CNN can include, for example, computing the trained CNN through a training phase or process of a machine learning process (e.g., based on training data or sample inputs), retrieving the trained CNN from a memory or another data store, or receiving the trained CNN from another source (e.g., another data processing apparatus that performs the training process). The trained CNN can receive and analyze input data (e.g., an input image), and predict output data (e.g., respective probabilities of categories or classifications of the input image).
  • The trained CNN has a known network architecture (e.g., an ordered sequence of different layers) defined by known parameters of each layer in the trained CNN. Each layer in the trained CNN can operate on the input data to the trained CNN (either directly or by the virtue of operating on an output of a previous layer in the trained CNN). A layer that operates on data in the trained CNN prior to another layer is referred to as being a prior, previous, or upstream layer relative to the another layer. A layer that operates on data in the trained CNN following or after another layer's operation on data is referred to as a later or downstream layer relative to the another layer. Typically, an output of a prior layer is served as an input into the adjacent later layer. As such, the input data to the trained CNN go through each layer of the trained CNN in an order from the beginning of the CNN (e.g., an input layer that receives the input data or the initial layer that operates directly on the input data) to the ending of the CNN (e.g., an output layer that outputs a result of the CNN).
  • Particularly, the trained CNN includes one or more convolutional layers. Each of the one or more convolutional layers includes a respective one or more filters (or kernels) with known filter parameters (e.g., the number of filters, filter sizes, parameter values of the filter, and connection weights). Different convolutional layers may include the same or different number of filters. Each filter is defined by known filter parameters or weights. In some implementations, a filter can be represented by a matrix, such as the filter ωo,c∈Rr×r as described with respect to FIG. 2. The values of the filter matrix and connection weights are learned and known during the training process, and these values will not change when using the trained CNN, for example, for prediction based on input data. In some implementations, the filters and/or the connection weights of a convolutional layer can be collectively represented by a tensor. For example, each convolutional layer of the trained CNN can be represented by the filter matrices and a weight tensor (e.g., the tensor W as described with respect to FIG. 2) that includes parameters or weights of the filters or kernels of the CNN. From 310, method 300 proceeds to 320.
  • At 320, a reusable factor for each of the one or more convolutional layers can be pre-computed based on the known filter parameters of the trained CNN. For example, for each of the one or more convolutional layers, a reusable factor that only depends on known, unchanged parameters can be identified and computed independently of any input data to the CNN. For example, the tensor U, as described with respect to the example algorithm 200 in FIG. 2, is an example of the reusable factor for each of the one or more convolutional layers of the trained CNN. The tensor U can be computed according to steps 1-6 of the example algorithm 200 because all the parameters needed for such a computation (e.g., W and ωo,c) are known given the trained CNN and remain unchanged during the prediction phase based on the trained CNN.
  • In some implementations, pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained CNN includes computing the reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained CNN offline regardless of the input data to the trained CNN. As a result, for each convolutional layer of the trained CNN, the reusable factor only needs to be computed once, for example, by offline processing and can be retrieved and reused during runtime computation. From 320, method 300 proceeds to 330.
  • At 330, the pre-computed reusable factor for each of the one or more convolutional layers can be saved, for example, in a data store for ease of later retrieval when using the trained CNN, for example, for prediction. In some implementations, the saved pre-computed reusable factor can be shared, published or otherwise transmitted to other data processing devices. From 330, method 300 proceeds to 340.
  • At 340, a determination is made as to whether input data to the trained CNN are received. In response to determining that input data to the trained CNN are received, method 300 proceeds to 350. Otherwise, if it is determined that no input data to the trained CNN are received, method 300 proceeds to 380, where method 300 stops.
  • The input data can include, for example, image data of one or more images (e.g., a photo including a face of a person), or other types of input data. Each image can include one or more channels. For example, an image from a standard digital camera can be regarded as having three channels red, green and blue. One the other hand, a grayscale image can be regarded as having a single channel.
  • In some implementations, the input data can be represented by a tensor that can include scalars, vectors, matrices, and high-dimensional tensors. As an example, a grayscale image can be represented by a 2d matrix of pixel values. The value of each pixel in the matrix can range, for example, from 0 to 255, with zero indicating black and 255 indicating white. In some implementations, the value of each pixel is normalized by dividing it by 255. Each channel of a color image can include a 2d matrix having pixel values, for example, in the range 0 to 255 or in the range 0 to 1.0 after normalization. The 2d-matrices stacked over each other (one for each color) can form a tensor that represents the color image. In some implementations, the image data can include a tensor that includes multiple images.
  • For example, the input tensor X as described with respect to the example algorithm 200 in FIG. 2 is an example of the input data. The input data can be divided into sub-blocks or channels, for example, according to the techniques as described with respect to the example algorithm 200. The convolution operation can be performed on each of the sub-blocks or channels of the data, for example, according to the Winograd algorithm or another method.
  • In some implementations, the input data to CNN can go through each layer of the trained CNN according to the known network architecture of the trained CNN. From 340, method 300 proceeds to 350.
  • At 350, an output of the each of the one or more convolutional layers (say, Layer k) is computed based on the pre-computed reusable factor and the input data. The output of the Layer k includes a result of a convolution operation performed between an input to the Layer k and the filters of the Layer k. The input to the Layer k can include the input data to the trained CNN, an output of a previous layer (e.g., Layer k−1) of the trained CNN, or a combination of them. In some implementations, the output of the Layer k can be computed using a Winograd convolutional operator (e.g., the Wino-Conv2D 135 as described with respect to FIG. 1) based on the pre-computed reusable factor (e.g., the tensor U 125) and the input data (either directly or indirectly by the virtue of operating on the output of a previous layer). In some implementations, the output of the Layer k can be computed according to the Winograd minimal filtering algorithm (e.g., the example algorithm 200 described with respect to FIG. 2). From 350, method 300 proceeds to 360.
  • At 360, output data of the trained convolutional network is determined based on the output of the each of the one or more convolutional layers. In some implementations, determining output data of the trained convolutional network includes generating, predicting, or otherwise computing the output data of the trained CNN. The output data can include a prediction, classification, or other features or attributes derived from the input data by the operations of the trained CNN. For example, the output data can include a vector of probabilities of possible categories (e.g., a person, a dog, a car, or a tree) of the input image. As another example, the output data can include an identification of an object in the input image (e.g., for face recognition). As yet another example, the output data can include an identification or categorization of vehicular damages based on input photos of a vehicle. In some implementations, the output data can include an enhanced or filtered version of the input data (e.g., a sharpened or blurred version of an input photo).
  • In some implementations, the output data of the trained convolutional network are determined based on the output of the each of the one or more convolutional layers in that the output data of the trained convolutional network are determined based on one or more of the output of the each of the one or more convolutional layers (e.g., the output of the Layer k, for k∈[1, L], where L is the total number of the layers in the trained CNN. Typically, the output data of the trained convolutional network is determined based on the output of the last layer, Layer L, of the trained CNN, where the output of the Layer L is determined based on the output of the previous layer, Layer L−1, and so on, until tracing back to the input data to the trained CNN. In some implementations, the output data of the trained convolutional network can be determined based on the output of the each of the one or more convolutional layers in another manner. From 360, method 300 proceeds to 370.
  • At 370, the output data of the trained convolutional network can be output, for example, via a user interface (e.g., a graphical user interface). For example, the output data of the trained convolutional network can be represented in a table, a graph, a text, or another format and displayed to a user via a screen or another user interface. In some implementations, the output data of the trained convolutional network can be saved, transmitted, or otherwise output to another device (e.g., a storage device or another data processing apparatus for further processing).
  • From 370, method 300 goes back to 340 to determine if any input data (e.g., referred to as second input data) to the trained CNN are received. In some implementations, the second input data to the trained CNN can include additional or updated input data (e.g., another set of images) to the trained CNN for analyzing and prediction. Accordingly, method 300 can proceed to 350 for computing a second output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the second input data; and then to 360 for determining second output data of the trained convolutional network based on the second output of the each of the one or more convolutional layers.
  • FIG. 4 is a block diagram of an example computer system 400 used to provide computational functionalities associated with described algorithms, methods, functions, processes, flows, and procedures, as described in the instant disclosure, according to an implementation. The illustrated computer 402 is intended to encompass any computing device such as a server, desktop computer, laptop/notebook computer, wireless data port, smart phone, personal data assistant (PDA), tablet computing device, one or more processors within these devices, or any other suitable processing device, including physical or virtual instances (or both) of the computing device. Additionally, the computer 402 may include a computer that includes an input device, such as a keypad, keyboard, touch screen, or other device that can accept user information, and an output device that conveys information associated with the operation of the computer 402, including digital data, visual, or audio information (or a combination of information), or a graphical-type user interface (UI) (or GUI).
  • The computer 402 can serve in a role as a client, network component, a server, a database or other persistency, or any other component (or a combination of roles) of a computer system for performing the subject matter described in the instant disclosure. The illustrated computer 402 is communicably coupled with a network 430. In some implementations, one or more components of the computer 402 may be configured to operate within environments, including cloud-computing-based, local, global, or other environment (or a combination of environments).
  • At a high level, the computer 402 is an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the described subject matter. According to some implementations, the computer 402 may also include or be communicably coupled with an application server, e-mail server, web server, caching server, streaming data server, or other server (or a combination of servers).
  • The computer 402 can receive requests over network 430 from a client application (for example, executing on another computer 402) and respond to the received requests by processing the received requests using an appropriate software application(s). In addition, requests may also be sent to the computer 402 from internal users (for example, from a command console or by other appropriate access method), external or third-parties, other automated applications, as well as any other appropriate entities, individuals, systems, or computers.
  • Each of the components of the computer 402 can communicate using a system bus 403. In some implementations, any or all of the components of the computer 402, hardware or software (or a combination of both hardware and software), may interface with each other or the interface 404 (or a combination of both), over the system bus 403 using an application programming interface (API) 412 or a service layer 413 (or a combination of the API 412 and service layer 413). The API 412 may include specifications for routines, data structures, and object classes. The API 412 may be either computer-language independent or dependent and refer to a complete interface, a single function, or even a set of APIs. The service layer 413 provides software services to the computer 402 or other components (whether or not illustrated) that are communicably coupled to the computer 402. The functionality of the computer 402 may be accessible for all service consumers using this service layer. Software services, such as those provided by the service layer 413, provide reusable, defined functionalities through a defined interface. For example, the interface may be software written in JAVA, C++, or other suitable language providing data in extensible markup language (XML) format or other suitable format. While illustrated as an integrated component of the computer 402, alternative implementations may illustrate the API 412 or the service layer 413 as stand-alone components in relation to other components of the computer 402 or other components (whether or not illustrated) that are communicably coupled to the computer 402. Moreover, any or all parts of the API 412 or the service layer 413 may be implemented as child or sub-modules of another software module, enterprise application, or hardware module without departing from the scope of this disclosure.
  • The computer 402 includes an interface 404. Although illustrated as a single interface 404 in FIG. 4, two or more interfaces 404 may be used according to particular needs, desires, or particular implementations of the computer 402. The interface 404 is used by the computer 402 for communicating with other systems that are connected to the network 430 (whether illustrated or not) in a distributed environment. Generally, the interface 404 includes logic encoded in software or hardware (or a combination of software and hardware) and is operable to communicate with the network 430. More specifically, the interface 404 may include software supporting one or more communication protocols associated with communications such that the network 430 or interface's hardware is operable to communicate physical signals within and outside of the illustrated computer 402.
  • The computer 402 includes a processor 405. Although illustrated as a single processor 405 in FIG. 4, two or more processors may be used according to particular needs, desires, or particular implementations of the computer 402. Generally, the processor 405 executes instructions and manipulates data to perform the operations of the computer 402 and any algorithms, methods, functions, processes, flows, and procedures as described in the instant disclosure.
  • The computer 402 also includes a database 406 that can hold data for the computer 402 or other components (or a combination of both) that can be connected to the network 430 (whether illustrated or not). For example, database 406 can be an in-memory, conventional, or other type of database storing data consistent with this disclosure. In some implementations, database 406 can be a combination of two or more different database types (for example, a hybrid in-memory and conventional database) according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. Although illustrated as a single database 406 in FIG. 4, two or more databases (of the same or combination of types) can be used according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. While database 406 is illustrated as an integral component of the computer 402, in alternative implementations, database 406 can be external to the computer 402. As illustrated, the database 406 holds one or more trained CNNs 416, pre-computed reusable factors 418 of each convolutional layer of the one or more trained CNNs 416, and Winograd algorithm 426, for fast computation of a CNN.
  • The computer 402 also includes a memory 407 that can hold data for the computer 402 or other components (or a combination of both) that can be connected to the network 430 (whether illustrated or not). Memory 407 can store any data consistent with this disclosure. In some implementations, memory 407 can be a combination of two or more different types of memory (for example, a combination of semiconductor and magnetic storage) according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. Although illustrated as a single memory 407 in FIG. 4, two or more memories 407 (of the same or combination of types) can be used according to particular needs, desires, or particular implementations of the computer 402 and the described functionality. While memory 407 is illustrated as an integral component of the computer 402, in alternative implementations, memory 407 can be external to the computer 402.
  • The application 408 is an algorithmic software engine providing functionality according to particular needs, desires, or particular implementations of the computer 402, particularly with respect to functionality described in this disclosure. For example, application 408 can serve as one or more components, modules, or applications. Further, although illustrated as a single application 408, the application 408 may be implemented as multiple applications 408 on the computer 402. In addition, although illustrated as integral to the computer 402, in alternative implementations, the application 408 can be external to the computer 402.
  • The computer 402 can also include a power supply 414. The power supply 414 can include a rechargeable or non-rechargeable battery that can be configured to be either user- or non-user-replaceable. In some implementations, the power supply 414 can include power-conversion or management circuits (including recharging, standby, or other power management functionality). In some implementations, the power-supply 414 can include a power plug to allow the computer 402 to be plugged into a wall socket or other power source to, for example, power the computer 402 or recharge a rechargeable battery.
  • There may be any number of computers 402 associated with, or external to, a computer system containing computer 402, each computer 402 communicating over network 430. Further, the term “client,” “user,” and other appropriate terminology may be used interchangeably, as appropriate, without departing from the scope of this disclosure. Moreover, this disclosure contemplates that many users may use one computer 402, or that one user may use multiple computers 402.
  • Described implementations of the subject matter can include one or more features, alone or in combination.
  • For example, in a first implementation, a computer-implemented method including: obtaining, by a data processing apparatus, a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing, by the data processing apparatus, a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving, by the data processing apparatus, input data to the trained convolutional neural network; computing, by the data processing apparatus, an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining, by the data processing apparatus, output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • In a second implementation, a non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations including: obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • In a third implementation, a computer-implemented system, including: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing instructions, that when executed by the one or more computers, perform operations including: obtaining a trained convolutional neural network including one or more convolutional layers, each of the one or more convolutional layers including a number of filters with known filter parameters; pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network; receiving input data to the trained convolutional neural network; computing an output of the each of the one or more convolutional layers using a Winograd convolutional operator based on the pre-computed reusable factor and the input data; and determining output data of the trained convolutional network based on the output of the each of the one or more convolutional layers.
  • The foregoing and other described implementations can each, optionally, include one or more of the following features:
  • A first feature, combinable with any of the following features, wherein pre-computing a reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network includes computing the reusable factor for each of the one or more convolutional layers based on the known filter parameters of the trained convolutional neural network offline regardless of the input data to the trained convolutional neural network.
  • A second feature, combinable with any of the previous or following features, further including saving the pre-computed reusable factor for each of the one or more convolutional layers.
  • A third feature, combinable with any of the previous or following features, wherein computing an output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the input data includes computing an output of the each of the one or more convolutional layers according to a Winograd minimal filtering algorithm with the pre-computed reusable factor and the input data.
  • A fourth feature, combinable with any of the previous or following features, wherein the input data to the trained convolutional neural network includes one or more images.
  • A fifth feature, combinable with any of the previous or following features, the method or the operations further including: receiving, by the data processing apparatus, second input data to the trained convolutional neural network; computing, by the data processing apparatus, a second output of the each of the one or more convolutional layers based on the pre-computed reusable factor and the second input data; and predicting, by the data processing apparatus, second output data of the trained convolutional network based on the second output of the each of the one or more convolutional layers.
  • A sixth feature, combinable with any of the previous or following features, the method or the operations further including outputting the output data of the trained convolutional network via a user interface.
  • Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Software implementations of the described subject matter can be implemented as one or more computer programs, that is, one or more modules of computer program instructions encoded on a tangible, non-transitory, computer-readable computer-storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively, or additionally, the program instructions can be encoded in/on an artificially generated propagated signal, for example, a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer-storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of computer-storage mediums. Configuring one or more computers means that the one or more computers have installed hardware, firmware, or software (or combinations of hardware, firmware, and software) so that when the software is executed by the one or more computers, particular computing operations are performed.
  • The term “real-time,” “real time,” “realtime,” “real (fast) time (RFT),” “near(ly) real-time (NRT),” “quasi real-time,” or similar terms (as understood by one of ordinary skill in the art), means that an action and a response are temporally proximate such that an individual perceives the action and the response occurring substantially simultaneously. For example, the time difference for a response to display (or for an initiation of a display) of data following the individual's action to access the data may be less than 1 ms, less than 1 sec., or less than 5 secs. While the requested data need not be displayed (or initiated for display) instantaneously, it is displayed (or initiated for display) without any intentional delay, taking into account processing limitations of a described computing system and time required to, for example, gather, accurately measure, analyze, process, store, or transmit the data.
  • The terms “data processing apparatus,” “computer,” or “electronic computer device” (or equivalent as understood by one of ordinary skill in the art) refer to data processing hardware and encompass all kinds of apparatus, devices, and machines for processing data, including by way of example, a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include special purpose logic circuitry, for example, a central processing unit (CPU), an FPGA (field programmable gate array), or an ASIC (application-specific integrated circuit). In some implementations, the data processing apparatus or special purpose logic circuitry (or a combination of the data processing apparatus or special purpose logic circuitry) may be hardware- or software-based (or a combination of both hardware- and software-based). The apparatus can optionally include code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of execution environments. The present disclosure contemplates the use of data processing apparatuses with or without conventional operating systems, for example LINUX, UNIX, WINDOWS, MAC OS, ANDROID, IOS, or any other suitable conventional operating system.
  • A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, for example, one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, for example, files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • While portions of the programs illustrated in the various figures are shown as individual modules that implement the various features and functionality through various objects, methods, or other processes, the programs may instead include a number of sub-modules, third-party services, components, libraries, and such, as appropriate. Conversely, the features and functionality of various components can be combined into single components, as appropriate. Thresholds used to make computational determinations can be statically, dynamically, or both statically and dynamically determined.
  • The methods, processes, or logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The methods, processes, or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, for example, a CPU, an FPGA, or an ASIC.
  • Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors, both, or any other kind of CPU. Generally, a CPU will receive instructions and data from and write to a memory. The essential elements of a computer are a CPU, for performing or executing instructions, and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, receive data from or transfer data to, or both, one or more mass storage devices for storing data, for example, magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, for example, a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device, for example, a universal serial bus (USB) flash drive, to name just a few.
  • Computer-readable media (transitory or non-transitory, as appropriate) suitable for storing computer program instructions and data includes all forms of permanent/non-permanent or volatile/non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, for example, random access memory (RAM), read-only memory (ROM), phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic devices, for example, tape, cartridges, cassettes, internal/removable disks; magneto-optical disks; and optical memory devices, for example, digital video disc (DVD), CD-ROM, DVD+/−R, DVD-RAM, DVD-ROM, HD-DVD, and BLURAY, and other optical memory technologies. The memory may store various objects or data, including caches, classes, frameworks, applications, modules, backup data, jobs, web pages, web page templates, data structures, database tables, repositories storing dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto. Additionally, the memory may include any other appropriate data, such as logs, policies, security or access data, reporting files, as well as others. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on a computer having a display device, for example, a CRT (cathode ray tube), LCD (liquid crystal display), LED (Light Emitting Diode), or plasma monitor, for displaying information to the user and a keyboard and a pointing device, for example, a mouse, trackball, or trackpad by which the user can provide input to the computer. Input may also be provided to the computer using a touchscreen, such as a tablet computer surface with pressure sensitivity, a multi-touch screen using capacitive or electric sensing, or other type of touchscreen. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, for example, visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
  • The term “graphical user interface,” or “GUI,” may be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, a GUI may represent any graphical user interface, including but not limited to, a web browser, a touch screen, or a command line interface (convolutional layerI) that processes information and efficiently presents the information results to the user. In general, a GUI may include one or more user interface (UI) elements, some or all associated with a web browser, such as interactive fields, pull-down lists, and buttons. These and other UI elements may be related to or represent the functions of the web browser.
  • Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, for example, as a data server, or that includes a middleware component, for example, an application server, or that includes a front-end component, for example, a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of wireline or wireless digital data communication (or a combination of data communication), for example, a communication network. Examples of communication networks include a local area network (LAN), a radio access network (RAN), a metropolitan area network (MAN), a wide area network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a wireless local area network (WLAN) using, for example, 802.11 a/b/g/n or 802.20 (or a combination of 802.11x and 802.20 or other protocols consistent with this disclosure), all or a portion of the Internet, or any other communication system or systems at one or more locations (or a combination of communication networks). The network may communicate with, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, or other suitable information (or a combination of communication types) between network addresses.
  • The computing system can include clients and servers. A client and a server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented, in combination, in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations, separately, or in any suitable sub-combination. Moreover, although previously described features may be described as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
  • Particular implementations of the subject matter have been described. Other implementations, alterations, and permutations of the described implementations are within the scope of the following claims as will be apparent to those skilled in the art. While operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed (some operations may be considered optional), to achieve desirable results. In certain circumstances, multitasking or parallel processing (or a combination of multitasking and parallel processing) may be advantageous and performed as deemed appropriate.
  • Moreover, the separation or integration of various system modules and components in the previously described implementations should not be understood as requiring such separation or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
  • Accordingly, the previously described example implementations do not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.
  • Furthermore, any claimed implementation is considered to be applicable to at least a computer-implemented method; a non-transitory, computer-readable medium storing computer-readable instructions to perform the computer-implemented method; and a computer system including a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method or the instructions stored on the non-transitory, computer-readable medium.

Claims (20)

1. A computer-implemented method, comprising
obtaining, by a data processing apparatus, a trained convolutional neural network comprising at least a first convolutional layer and a second convolutional layer, each of the at least the first convolutional layer and the second convolutional layer comprising a corresponding plurality of filters with known filter parameters;
pre-computing, by the data processing apparatus, a first reusable factor for the first convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
pre-computing, by the data processing apparatus, a second reusable factor for the second convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
identifying, by the data processing apparatus, a first input data to the first convolutional layer of the trained convolutional neural network;
computing, by the data processing apparatus, a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data;
using, by the data processing apparatus, the first output as a second input data to the second convolutional layer of the trained convolutional neural network; and
computing, by the data processing apparatus, a second output of the second convolutional layer using the Winograd convolutional operator based on the second reusable factor and the second input data.
2. (canceled)
3. The computer-implemented method of claim 1, further comprising saving the first reusable factor for the first convolutional layer.
4. The computer-implemented method of claim 1, wherein computing a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data comprises computing the first output of the first convolutional layer according to a Winograd minimal filtering algorithm with the first reusable factor and the first input data
5. The computer-implemented method of claim 1, wherein the first input data to the trained convolutional neural network comprises one or more images.
6. The computer-implemented method of claim 1, further comprising:
determining, by the data processing apparatus, output data of the trained convolutional neural network based on the second output of the second convolutional layer.
7. The computer-implemented method of claim 1, further comprising outputting output data of the trained convolutional neural network via a user interface
8. A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform operations comprising:
obtaining a trained convolutional neural network comprising at least a first convolutional laver and a second convolutional layer, each of the at least the first convolutional laver and the second convolutional layer comprising a corresponding plurality of filters with known filter parameters;
pre-computing a first reusable factor for the first convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
pre-computing a second reusable factor for the second convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
identifying a first input data to the first convolutional laver of the trained convolutional neural network;
computing a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data;
using the first output as a second input data to the second convolutional layer of the trained convolutional neural network; and
computing a second output of the second convolutional laver using the Winograd convolutional operator based on the second reusable factor and the second input data.
9. (canceled)
10. The non-transitory, computer-readable medium of claim 8, the operations further comprising saving the first reusable factor for the first convolutional layer.
11. The non-transitory, computer-readable medium of claim 8, wherein computing a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data comprises computing the first output of the first convolutional layer according to a Winograd minimal filtering algorithm with the first reusable factor and the first input data.
12. The non-transitory, computer-readable medium of claim 8, wherein the first input data to the trained convolutional neural network comprises one or more images.
13. The non-transitory, computer-readable medium of claim 8, the operations further comprising:
determining output data of the trained convolutional neural network based on the second output of the second convolutional layer.
14. The non-transitory, computer-readable medium of claim 8, the operations further comprising outputting output data of the trained convolutional neural network via a user interface.
15. A computer-implemented system, comprising:
one or more computers; and
one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing instructions, that when executed by the one or more computers, perform operations comprising:
obtaining a trained convolutional neural network comprising at least a first convolutional layer and a second convolutional layer, each of the at least the first convolutional layer and the second convolutional layer comprising a corresponding plurality of filters with known filter parameters.
pre-computing a first reusable factor for the first convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
pre-computing a second reusable factor for the second convolutional layer based on the known filter parameters regardless of any input data to the trained convolutional neural network;
identifying a first input data to the first convolutional laver of the trained convolutional neural network;
computing a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data;
using the first output as a second input data to the second convolutional layer of the trained convolutional neural network; and
computing a second output of the second convolutional laver using the Winograd convolutional operator based on the second reusable factor and the second input data.
16. (canceled)
17. The computer-implemented system of claim 15, the operations further comprising saving the first reusable factor for the first convolutional layer.
18. The computer-implemented system of claim 15, wherein computing a first output of the first convolutional layer using a Winograd convolutional operator based on the first reusable factor and the first input data comprises computing the first output of the first convolutional layer according to a Winograd minimal filtering algorithm with the first reusable factor and the first input data.
19. The computer-implemented system of claim 15, the operations further comprising:
determining output data of the trained convolutional neural network based on the second output of the second convolutional layer.
20. The computer-implemented system of claim 15, further comprising a user interface; and the operations further comprising outputting output data of the trained convolutional neural network via the user interface.
US16/390,042 2018-10-24 2019-04-22 Fast computation of a convolutional neural network Active US10635951B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/111655 WO2020082263A1 (en) 2018-10-24 2018-10-24 Fast computation of convolutional neural network

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111655 Continuation WO2020082263A1 (en) 2018-10-24 2018-10-24 Fast computation of convolutional neural network

Publications (2)

Publication Number Publication Date
US10635951B1 US10635951B1 (en) 2020-04-28
US20200134400A1 true US20200134400A1 (en) 2020-04-30

Family

ID=66850328

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/390,042 Active US10635951B1 (en) 2018-10-24 2019-04-22 Fast computation of a convolutional neural network

Country Status (14)

Country Link
US (1) US10635951B1 (en)
EP (1) EP3662414A4 (en)
JP (1) JP6798021B1 (en)
KR (1) KR102141324B1 (en)
CN (1) CN110537193A (en)
AU (1) AU2018353930B2 (en)
BR (1) BR112019008055B1 (en)
CA (1) CA3040685C (en)
MX (1) MX2019004654A (en)
PH (1) PH12019500889A1 (en)
RU (1) RU2722473C1 (en)
SG (1) SG11201903591QA (en)
WO (1) WO2020082263A1 (en)
ZA (1) ZA201902547B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042923A1 (en) * 2017-08-07 2019-02-07 Intel Corporation System and method for an optimized winograd convolution accelerator
US20210081765A1 (en) * 2019-09-16 2021-03-18 Qualcomm Incorporated Efficient inferencing with fast pointwise convolution
EP4053746A4 (en) * 2019-11-01 2023-01-11 Cambricon Technologies Corporation Limited Winograd convolution operation method, apparatus, and device, and storage medium
EP4053753A4 (en) * 2019-11-01 2023-05-03 Cambricon Technologies Corporation Limited Operation apparatus and related product

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494608B2 (en) * 2019-08-14 2022-11-08 Intel Corporation Methods and apparatus to tile walk a tensor for convolution operations
US20210216868A1 (en) * 2020-01-08 2021-07-15 Maxim Integrated Products, Inc. Systems and methods for reducing memory requirements in neural networks
CN111294512A (en) * 2020-02-10 2020-06-16 深圳市铂岩科技有限公司 Image processing method, image processing apparatus, storage medium, and image pickup apparatus
CN111475775B (en) * 2020-04-14 2023-09-15 腾讯科技(深圳)有限公司 Data processing method, text processing method, device and equipment of graphic processor
CN111415000B (en) * 2020-04-29 2024-03-22 Oppo广东移动通信有限公司 Convolutional neural network, and data processing method and device based on convolutional neural network
CN111580828B (en) * 2020-04-30 2021-08-27 腾讯科技(深圳)有限公司 Compiling optimization method and device of machine learning model
CN111553466B (en) * 2020-04-30 2024-03-22 上海商汤智能科技有限公司 Information processing method, device and equipment
US11687336B2 (en) * 2020-05-08 2023-06-27 Black Sesame Technologies Inc. Extensible multi-precision data pipeline for computing non-linear and arithmetic functions in artificial neural networks
CN111767964A (en) * 2020-07-08 2020-10-13 福州大学 Improved DenseNet-based multi-channel feature re-labeling image classification method
JP2022018997A (en) * 2020-07-17 2022-01-27 ソニーセミコンダクタソリューションズ株式会社 Solid state image sensor, imaging apparatus, and information processing system
WO2022064901A1 (en) * 2020-09-28 2022-03-31 富士フイルム株式会社 Trained model transformation method, inference method, trained model transformation device, trained model, and inference device
EP4241206A4 (en) * 2020-12-01 2024-01-03 Huawei Tech Co Ltd Device and method for implementing a tensor-train decomposition operation
JP7420100B2 (en) 2021-03-15 2024-01-23 オムロン株式会社 Processing device, processing method, and program
KR20220162971A (en) * 2021-06-02 2022-12-09 세메스 주식회사 Data processing method and data comparing method
CN117495833A (en) * 2023-11-16 2024-02-02 广州思沛医药科技股份有限公司 Big data-based cerebral apoplexy prediction method, system and storage medium

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL272481B2 (en) * 2012-05-04 2024-03-01 Rearden Llc System and methods for coping with doppler effects in distributed-input distributed-output wireless systems
GB2539845B (en) * 2015-02-19 2017-07-12 Magic Pony Tech Ltd Offline training of hierarchical algorithms
US10403269B2 (en) * 2015-03-27 2019-09-03 Google Llc Processing audio waveforms
CN105740894B (en) * 2016-01-28 2020-05-29 北京航空航天大学 Semantic annotation method for hyperspectral remote sensing image
US20170344876A1 (en) * 2016-05-31 2017-11-30 Samsung Electronics Co., Ltd. Efficient sparse parallel winograd-based convolution scheme
CN106407986B (en) * 2016-08-29 2019-07-19 电子科技大学 A kind of identification method of image target of synthetic aperture radar based on depth model
GB2554711B (en) * 2016-10-06 2020-11-25 Imagination Tech Ltd Buffer addressing for a convolutional neural network
US10230725B2 (en) * 2016-10-24 2019-03-12 Sonicwall Inc. Edge protection for internal identity providers
US10482155B2 (en) * 2016-12-30 2019-11-19 Intel Corporation Winograd algorithm on a matrix processing architecture
RU2651147C1 (en) * 2017-03-27 2018-04-18 Акционерное общество "ЭЛВИС-НеоТек" Device of cascade processing of a stream of images by means of convolutional neural network
US10467795B2 (en) * 2017-04-08 2019-11-05 Intel Corporation Sub-graph in frequency domain and dynamic selection of convolution implementation on a GPU
CN107480707B (en) * 2017-07-26 2020-08-07 天津大学 Deep neural network method based on information lossless pooling
US10990648B2 (en) * 2017-08-07 2021-04-27 Intel Corporation System and method for an optimized winograd convolution accelerator
KR102452953B1 (en) * 2017-10-30 2022-10-11 삼성전자주식회사 Method and apparatus for performing convolution operation in neural network
CN107844833A (en) * 2017-11-28 2018-03-27 郑州云海信息技术有限公司 A kind of data processing method of convolutional neural networks, device and medium
US10372787B2 (en) * 2017-12-12 2019-08-06 Facebook, Inc. Hardware accelerator pre-configured with coefficients for matrix-transform operations

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042923A1 (en) * 2017-08-07 2019-02-07 Intel Corporation System and method for an optimized winograd convolution accelerator
US10990648B2 (en) * 2017-08-07 2021-04-27 Intel Corporation System and method for an optimized winograd convolution accelerator
US20210081765A1 (en) * 2019-09-16 2021-03-18 Qualcomm Incorporated Efficient inferencing with fast pointwise convolution
US11657282B2 (en) * 2019-09-16 2023-05-23 Qualcomm Incorporated Efficient inferencing with fast pointwise convolution
EP4053746A4 (en) * 2019-11-01 2023-01-11 Cambricon Technologies Corporation Limited Winograd convolution operation method, apparatus, and device, and storage medium
EP4053753A4 (en) * 2019-11-01 2023-05-03 Cambricon Technologies Corporation Limited Operation apparatus and related product

Also Published As

Publication number Publication date
JP6798021B1 (en) 2020-12-09
CN110537193A (en) 2019-12-03
KR102141324B1 (en) 2020-08-05
JP2021501377A (en) 2021-01-14
MX2019004654A (en) 2022-05-04
SG11201903591QA (en) 2020-05-28
KR20200049695A (en) 2020-05-08
PH12019500889A1 (en) 2019-06-17
EP3662414A1 (en) 2020-06-10
WO2020082263A1 (en) 2020-04-30
RU2722473C1 (en) 2020-06-01
AU2018353930A1 (en) 2020-05-14
US10635951B1 (en) 2020-04-28
CA3040685A1 (en) 2020-04-28
AU2018353930B2 (en) 2020-10-08
ZA201902547B (en) 2021-02-24
CA3040685C (en) 2020-07-28
BR112019008055B1 (en) 2022-02-01
EP3662414A4 (en) 2020-07-22
BR112019008055A2 (en) 2021-05-18

Similar Documents

Publication Publication Date Title
US10635951B1 (en) Fast computation of a convolutional neural network
US8843423B2 (en) Missing value imputation for predictive models
US9262493B1 (en) Data analytics lifecycle processes
US20190050465A1 (en) Methods and systems for feature engineering
US11132619B1 (en) Filtering in trainable networks
US10909471B2 (en) Resource-efficient machine learning
JP2017527013A (en) Adaptive characterization as a service
WO2022043798A1 (en) Automated query predicate selectivity prediction using machine learning models
US11811429B2 (en) Variational dropout with smoothness regularization for neural network model compression
US11934958B2 (en) Compressing generative adversarial neural networks
US11935271B2 (en) Neural network model compression with selective structured weight unification
US10997502B1 (en) Complexity optimization of trainable networks
US11568183B2 (en) Generating saliency masks for inputs of models using saliency metric
US11042891B2 (en) Optimizing revenue savings for actionable predictions of revenue change
US20210201157A1 (en) Neural network model compression with quantizability regularization
US20230206114A1 (en) Fair selective classification via a variational mutual information upper bound for imposing sufficiency
US20210232891A1 (en) Neural network model compression with structured weight unification
US11861452B1 (en) Quantized softmax layer for neural networks
US11062334B2 (en) Predicting ledger revenue change behavior of clients receiving services
US20220318993A1 (en) Predicting patient responses to a chemical substance
US20240119093A1 (en) Enhanced document ingestion using natural language processing
US20240119343A1 (en) Multi-objective work prioritization for common assets
CN113052309A (en) Method, computer system and storage medium for compressing neural network model
EP4100962A1 (en) Systems and methods for predicting patient responses
CN117135151A (en) Fault detection method of GPU (graphics processing unit) cluster, GPU cluster and electronic equipment

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: ALIBABA GROUP HOLDING LIMITED, CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, YONGCHAO;HUANG, QIYIN;PAN, GUOZHEN;AND OTHERS;REEL/FRAME:049929/0390

Effective date: 20190604

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
AS Assignment

Owner name: ADVANTAGEOUS NEW TECHNOLOGIES CO., LTD., CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALIBABA GROUP HOLDING LIMITED;REEL/FRAME:053743/0464

Effective date: 20200826

AS Assignment

Owner name: ADVANCED NEW TECHNOLOGIES CO., LTD., CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ADVANTAGEOUS NEW TECHNOLOGIES CO., LTD.;REEL/FRAME:053754/0625

Effective date: 20200910

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4