WO2017216976A1

WO2017216976A1 - Information processing method and device for neural network

Info

Publication number: WO2017216976A1
Application number: PCT/JP2016/068741
Authority: WO
Inventors: Vijay DAULTANI
Original assignee: Nec Corporation
Priority date: 2016-06-17
Filing date: 2016-06-17
Publication date: 2017-12-21

Abstract

An information processing device for a neural network, the information processing device including a neural network reconfiguration unit configured to swap an order of activation processing and pooling processing in a target neural network, when the activation processing is a non-decreasing function and the pooling processing is a max function, which is a portion of the target neural network in which convolution processing, the activation processing, and the pooling processing occur in order; and a processing unit configured to process input data by using a reconfigured neural network by the neural network reconfiguration unit.

Description

DESCRIPTION

INFORMATION PROCESSING METHOD AND DEVICE FOR NEURAL

NETWORK

TECHNICAL FIELD

The present disclosure relates to the field of convolution neural networks used in, for example, image processing. More specifically, the present disclosure relates to a method and device to arrange an order of layers in convolution neural networks.

BACKGROUND ART

Recently, deep learning has been widely applied to the field of machine learning, particularly through the use of artificial neural networks which have shown promising results in various fields. A convolution neural network (CNN), which is one class of artificial neural networks, has seen significant research contributions in the past few years. CNNs have exhibited exceptional properties which have inspired their use for a multitude of challenging tasks. Image processing, text processing, speech processing, trade markets, etc. are some examples of the many fields where CNNs are being applied.

Machine learning has a long history and such techniques have been applied in many fields for various tasks. Before CNNs were used for these tasks, designers of machine learning systems had to determine which input features should be used to train computers in order to achieve good results. Specific features were chosen based on the designer's experience and intuition. Neural networks used these manually decided features for learning on training data. Careful selection of features required a large amount of time and effort, and had a huge impact on the results of tasks that machine learning was used to solve. Such decisions with regard to choosing features were limited by a designer's capability of wisely choosing the correct set of features. However, the use of CNNs changed this by automatically learning the features and replaced the need for a designer to choose the features.

In general a CNN can be viewed as a computation graph which is a thin wrapper around nodes (i.e. layers) connected together in some order. This interconnection of layers which form a computation graph or a network is also known as a model. Different types of inputs ex, image, voice, etc. have different characteristics and hence a single CNN model which suits every type of input is unlikely. Therefore, new CNN models are often designed either to solve a new problem or optimize an existing model.

A CNN model includes a number of layers and their interconnections. A typical CNN model includes some common elements such as a convolution layer, an activation layer, a pooling layer, a fully connected layer, a softmax layer and an SVM layer.

Although the above mentioned elements may be common to CNN models, the configuration of the connections of these layers differentiates one CNN model from another.

Artificial neural networks can be thought of as a simplified emulation of the visual cortex system in a human brain. However, current artificial neural networks are designed with specific engineering goals and not to emulate all the functionalities of a brain. Hence, researchers have developed models inspired by very complex human visual cortex systems. This has an advantage in that it reduces the amount of

computations within the limits of current state of the art hardware. In these abstracted mathematical models, specific tasks from the visual cortex system may be assigned to specific layers in artificial neural networks. Layers in CNN models are arranged in specific patterns. For example, a convolutional layer is usually followed by an activation layer which is sometimes followed by a pooling layer. Together, the convolution and activation layers model the capability of a single cell in the brain, i.e. where a cell fires (activates) if an excitatory signal (encouraging the cell to transmit the information forward to other neurons) on its dendrites is strong enough, i.e. higher than some threshold. Similarly, in a CNN model, a neuron activates if the output of a convolution operation is stronger than a predetermined threshold.

Since CNNs can have millions of neurons, the computing capability required to perform the computation for convolution neural networks is proportional to the number of neurons in the network. Hence, there is high demand for methods to shrink the output of intermediate layers in order to reduce the amount of computation. In order to perform this shrinking, activation layers are usually followed by a pooling layer which shrinks the output of the activation layers.

Different CNN models can vary from each other in many ways. One of these differences can be depth of network (i.e. the number of layers in the network), size (height, width and depth) of each layer, type of activation functions, usage of pooling layers and others. Although different from each other, commonalities exist in the structure of CNNs as discussed above. Among all of the patterns that may exist in convolution neural networks, the present invention is concerned with a pattern in which a convolution layer is followed by an activation layer, which is followed by a pooling layer. When such pattern of a convolution layer, an activation layer, and a pooling layer exists, the respective operations of each layer are also executed in the same order.

As described in NPL1, a general and simple CNN model may have, for example, a configuration where an input of data to be processed is followed by a convolution layer, an activation layer, a pooling layer, and a fully connected layer, which may be the output of the CNN model.

As described in NPL2, different CNN models have different numbers of layers and different configurations for these layers. One example of a well-known CNN model is Alexnet, which is used for image recognition. Each CNN model differs based on design specifications for an intended application; however, the present disclosure is particularly concerned with the presence of three layers, i.e. a convolution layer, an activation layer, and a pooling layer in that order which is included in the example of Alexnet.

When such a pattern of the convolution layer, the activation layer and the pooling layer exists in a convolution neural network, it can be replaced by a pattern of the convolution layer, the pooling layer, and the activation layer (in this order) as disclosed in PTL 1. In PTL 1, it is shown that such an order of the layers can reduce the number of computations in the network, and suggests such an idea can be applied for any activation layer and pooling layer without regard to the function executed by each layer. However, it does not recognize that for certain functions used in the activation layer and the pooling layer, swapping the activation and pooling layer can produce unintended output or, in some cases, what are known as dead neurons, thus changing the output of parts of, and ultimately the entire, CNN. Citation List

Patent Literature

PTL1 : U.S. Application Publication No. 2015/0309961 Al

Non Patent Literature NPL1 : CS23 In Convolutional Neural Networks for Visual Recognition; http://cs231n.github.io/convolutional-networks/

NPL2: ImageNet Classification with Deep Convolutional Neural Networks;

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional- neural-networks.pdf

DISCLOSURE OF INVENTION

Technical Problem

Different CNN models may vary from each other in a variety of factors.

However, in order to emulate human visual cortex system they have a common pattern of structure for stacking layers together. Each cell in the human visual cortex system, works in two steps. First, to combine all the signals received at its dendrites and second, to fire (activate) if the result of the first step is more than some threshold.

The convolution operation is analogous to the first step of the process of a cell, and the activation operation is analogous to the second step of the process of the cell. Since the output size of intermediate layers of a CNN model may be very large, the pooling operation is usually performed after the activation layer. The pooling operation emulates sampling of the output of nearby cells. Performing the pooling operation also introduces non-linearity, which addresses a known problem of overfitting.

Because of the resemblance of a neural network to the human visual cortex system, a convolution operation followed by an activation operation, and then by a pooling operation is very natural. Since two steps of a cell, i.e. summing strengths of the incoming signal on dendrites and firing are logically mapped on convolution operation and activation operation of convolution neural network, the order (convolution followed by activation) tends to be a common configuration among neural network designs.

Fig. 3 shows a simplified form of a state of the art CNN model used for image recognition. It is evident from Fig. 3 that a convolution layer, followed by an activation layer, further followed by a pooling layer exists in actual models used for solving problems in real life.

Inherent to the existing order of these 3 layers (i.e. convolution, activation, pooling) which we find in most state of the art CNN networks is an opportunity for optimization which is not easily recognizable. In such a case where these 3 layers exist in order, it is possible to swap the activation layer and the pooling layer thereby reducing the number of operations required for processing and decreasing the computing costs (i.e., speed of processing) as mentioned above. However, the prior art (for example PTL1) does not recognize a serious problem in that, for certain functions used in the activation layer and the pooling layer, swapping the activation and pooling layer can cause degradation in the integrity of the output data of the CNN.

Also in some scenarios swapping of the activation and pooling layers can change the output to 0 which can introduce what are known as dead neurons in the network. Dead neurons can affect the CNN model in both the training phase and testing phase and change the results from expected or intended results. Solution to Problem

One object of the present disclosure is to provide a class of functions for both the activation layer and the pooling layer, which when used for the respective layers will ensure that the output of the overall network will never change, except between the swapped activation layer and pooling layer. Further, the present invention has an object of providing a method and a device for optimizing a CNN to reduce computing costs while at the same time maintaining the integrity of the output thereof.

In order to achieve the aforementioned objects, the present invention provides a device and a method which can optimize the processing operations of a CNN while maintaining the integrity of the output thereof.

Therefore, a first aspect of the present invention provides an information processing device for a neural network, the information processing device including a neural network reconfiguration unit configured to swap an order of activation processing and pooling processing in a target neural network, when the activation processing is a non-decreasing function and the pooling processing is a max function, which is a portion of the target neural network in which convolution processing, the activation processing, and the pooling processing occur in order; and a processing unit configured to process input data by using a reconfigured target neural network by the neural network reconfiguration unit.

A second aspect of the present invention, in accordance with the first aspect, further includes a neural network analyzation unit configured to analyze the target neural network by identifying a target portion to be reconfigured by the neural network reconfiguration unit.

A third aspect of the present invention provides a computer-implemented information processing method for a neural network, the method including identifying a target portion in which the neural network is configured to perform, in order, convolution processing, activation processing, and pooling processing; when the activation processing of the target portion is a non-decreasing function and the pooling processing of the target portion is a max function, swapping the order of the activation processing and the pooling processing in the target portion of the neural network so as to reconfigure the neural network; and processing input data using the reconfigured neural network.

A fourth aspect of the present invention provides a non-transitory computer readable medium containing program instructions for causing a computer to perform the method of the third aspect.

Advantageous Effects of Invention

The present invention improves a neural network by reducing the number of operations and computational costs in terms of speed and power consumption performed by an information processing device or computer implementing the processes of the neural network while maintaining the integrity of the output of the neural network.

BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a block diagram of a configuration of a computer system by which information processing device according to exemplary embodiments of the present disclosure may be achieved.

FIG. 2 is a block diagram which represents a general configuration of a CNN model.

FIG. 3 is a block diagram of a schematic configuration for a simplfied representation of a state of the art CNN model such as Alexnet.

FIG. 4 is a block diagram of a schematic configuration where swapping technique transforms each occurrence of convolution-activation-pooling in an input CNN model to convolution-pooling-activation, after determining it is safe. FIG. 5 shows an example of how the swapping technique in accordance with the present invention may reduce the number of operations compared to the conventional technique.

FIG. 6 shows a flow of control used in the present invention to determine whether or not to perform the swapping of the activation and pooling layer.

FIGS. 7 A, 7B, 7C are examples showing how the swapping technique in accordance with the present invention can reduce the number of operations when max, or tanh or sigmoid functions are used for activation operation respectively and a max function is used for the pooling layer while maintaining the integrity of the output results.

FIGS. 8A, 8B, 8C are examples showing how the swapping technique in accordance with the present invention can reduce the number of operations when there exists overlapping between consecutive pooling operations max, or tanh or sigmoid functions are used for activation operation respectively and a max function is used for the pooling layer while maintaining the integrity of the output results.

FIG. 9 shows an example in which swapping the activation layer and the pooling layer results in a different output than the case in which the layers are not swapped due to an improper selection for the pooling layer function.

FIG. 10 is a block diagram showing a example of a reconfiguration of a CNN model.

EXEMPLARY EMBODIMENTS FOR CARRYING OUT THE INVENTION

It is to be understood that the figures and descriptions of the present invention have been simplified to illustrate elements that are relevant for a clear understanding of the present invention. Hereinafter, embodiments of the present invention will be described with reference to the figures. Fig. 1 is a block diagram of a configuration of a computer 100 (also referred to as a "computer system") by which information processing apparatuses according to below-described exemplary embodiments of the present disclosure may be achieved. The computer system 100 includes a processor 110, a cache subsystem 120, a GPU subsystem 130, a graphic output device 140, a memory bridge 150, an I/O (Input/Output) subsystem 160, input devices 170 (e.g. a mouse 171 and a keyboard 172), a memory subsystem 180, and secondary storage 190. The computer system 100 may include a plurality of graphics output devices 140. The processor 110 includes registers 111. The registers 111 are used to stage data used by execution units included in the processor 110 from the cache subsystem 120. The registers 111 and other parts of the processor 110 are present on the same hip to reduce the latency. The cache subsystem 120 may have two or more levels of cache. The processor 110 and at least a level of cache subsystem may be implemented on the same chip. The number (e.g. level 1, level 2, level 3, etc.) and locations (on chip or off chip of the processor 110) of the levels may vary among systems having different architectures. Therefore, for the sake of simplification of the variation in configuration among systems having different architectures, the cache subsystem 120 is show as a module separated from process 110. Inputs like mouse and key board 170 are connected to the memory bridge 150 via I/O subsystem.

In order to optimize an existing neural network 200 in accordance with the present invention, sections of the neural network 200 which may be capable of being optimized are identified by searching for occurrences of, in order, a convolution layer 220, an activation layer 230, and a pooling layer 240. Such sections are present in the example neural networks 200 shown in Figs. 2 and 3.

Here, it can be seen that there is one occurrence of a convolution layer 220 followed by an activation layer 230 followed by a pooling layer 240 in Fig. 2, and three occurrences in Fig 3. It should be noted that while the layers in Figs. 2 and 3 are shown having different shapes and sizes, the same reference numbers are used in accordance with the type of layer. Layers with the same reference number do not necessarily perform the same processing function as will be described in more detail below.

In the example of Fig. 2, input data 210 is input to the neural network 200, first to a convolution layer 220 where convolution is performed on the input data 210. The convolution processing function is the same function for all convolution layers. Next, the output of the convolution layer 220 is input to the activation layer 230 where activation processing is performed. The activation processing may be (depending on the neural network configuration) one of any number of processing functions commonly used for activation processing, such as a max function, a tanh function, a sigmoid function, or the like. It is possible that multiple activation layers 230 occurring within the same neural network 200 (as in Fig. 3) may use different processing functions as the activation processing of each layer. Next, the output of the activation layer 230 is input to the pooling layer 240 where pooling processing is performed. The pooling processing may be (depending on the neural network configuration) one of any number of processing functions commonly used for pooling processing, such as a max function, an average function, or the like. Again, it is possible that multiple pooling layers 240 occurring within the same neural network may use different processing functions as the pooling processing.

Next, the swapping technique used in the present invention will be described. As previously mentioned, when there is an occurrence of, in order, a convolution layer 220, an activation layer 230, and a pooling layer 230 in a neural network 200, it may be possible to optimize this portion of the neural network 200 if swapping the order of the activation layer 230 and the pooling layer would not cause a degradation of the output of this portion of the neural network. Therefore, the activation layer 230 is examined (analyzed) to confirm that the activation processing performed thereby is a non- decreasing function, and the pooling layer 240 is examined (analyzed) to confirm that the pooling processing performed thereby is a max function. This is because, if the activation layer has a decreasing function or the pooling layer is an averaging function (as in Fig. 9), swapping order of the activation layer and the pooling layer may change the output of this portion of the neural network or may produce dead neurons which have a potential to adversely effect the output of the entire neural network.

In a case that there is an occurrence of, in order, a convolution layer 220, an activation layer 230, and a pooling layer 240; the activation layer 230 is a non-decreasing function; and the pooling layer 240 is a max function; then, swapping of the activation layer 230 and the pooling layer 240 is performed in order to reconfigure this portion of the neural network to become the convolution layer 220 followed by the pooling layer 240 followed by the activation layer 230. This swapping may be performed on as many applicable occurrences of a convolution layer 220 followed by an activation layer 230 followed by a pooling layer of a neural network as necessary to reconfigure the entire neural network. For example, in the case of Fig. 3, reconfiguration may occur up to three times if all the occurrences of the convolution, activation, pooling (in order) meet the above requirements for the swapping technique.

A first embodiment of the present invention is an information processing device for a neural network determining whether or not it is safe to swap the order of the activation layer and the pooling layer without changing the output.

Fig. 2 is a block diagram of a general configuration of a CNN model. It shows a simplified CNN model which may be used, for example, for image recognition can be comprised of input 210, followed by a neural network 200, followed by a fully connected layer 250, which may also generate output of the CNN model. The neural network 200 in this example includes a convolution layer 220, followed by an activation layer 230, followed by a pooling layer 240.

Fig.3 is a block diagram of a schematic configuration for a simplified representation of an actual CNN model used for image recognition in real world according to one embodiment of the invention. Input 210 is an image input to the CNN network. The input 210 is connected to a neural network 200, which is connected to softmax 290, which is the output of the system. This layer (softmax 290) finds the class of the object in the image. The neural network 200 includes any number of processing units, which are potential target portions for reconfiguration. One processing unit is a combination of a convolution layer 220 and an activation layer 230, and optionally a pooling layer 240. The neural network 200 may further include, one or more of fully connected layers 250, each of which may be combined with an activation layer 230.

As shown in Fig. 3, patterns of order of layers i.e. convolution 220 followed by activation 230 followed by pooling 240 are very common in the network. Input 210 of the network is forwarded to the first processing unit (a target network or a target portion) and taken as an input by the convolution layer 220 in the unit, which performs the task of pattern matching also known as the operation of convolution. The output of the convolution layer 220 is given as an input to the activation layer 230 which performs the task of the activation. The activation layer 230 can perform the operation of the activation using any one out of various functions such as max, tanh, or sigmoid.

Although sigmoid was a very common function used for the activation in some old CNN architectures, it became out of fashion in most of the CNN networks nowadays. Output of the activation layer 230 may be given as an input to the pooling layer 240 which performs the task of the pooling or resizing. Pooling layer 240 can use one out of various functions such as max or average, in order to perform the task of the resizing. For performing the pooling, max operation (max function) is most preferable while average function can also be used in practice.

The neural network 200 in this embodiment includes a three repetitive combinations of one fully connected layer 250 followed by one activation layer 230. These layers are basically performing the task of the linear classifier, which is used to generate the score for several classes of the input. Softmax layer calculates the loss or error occurred while training or accuracy while performing the testing phase.

Fig 4. Shows the block diagram of swapping technique, where if in input CNN model, a pattern of convolution-activation-pooling is found, swapping technique first determines if it is safe to swap activation and pooling layer, by using steps explained in Fig. 6. If it is found to be safe then such occurrence of convolution-activation-pooling is replaced with convolution-pooling-activation.

FIG. 5 shows an example, how the embodiment may reduce the number of operations compared to the related arts. It is shown in this figure that in the related art (NPL 1). The total number of operations performed by a activation layer and pooling layer together were 4 + 3 = 7 respectively. Where as in the embodiment the number of operations performed by such a technique is reduced to 3 + 1 = 4 respectively.

FIG. 6 shows a flow of control used by the proposed technique to determine if it's safe to perform the swapping of activation and pooling layer. It is shown in step S610 that the swapping technique first analyzes the input CNN model configuration. It searches for a pattern where convolution layer is followed by the activation layer which is further followed by the pooling layer. Then in step S620, if such a pattern is found then it check for the function used for the activation layer in step S630. If no such pattern is found in the input CNN model the flow reaches end of the reconfiguration processing. If in step S630, if it is found that non-decreasing (or monotonically increasing) function is used then control reaches step S640 where it checks for the function used for the pooling layer. If it is found that a non-decreasing function is not used for the activation function control reaches S610. In step S640 if it is found that max function is used for the pooling layer then control reaches step S650. If in step S640 it is found that max function is not used for the pooling layer control reaches step S610. In step S650 the

reconfiguration of the network is found to be safe and occurrence of convolution- activation-pooling is replaced with convolution-pooling-activation.

FIG. 7A is an example showing how proposed idea can reduce the number of operations in case, when max function is used for activation operation and max function is used for pooling layer. FIG. 7A considers a case when there is no overlap between pooling operations. In related art NPL 1 , convolution layer is followed by activation layer, and activation layer is followed by pooling layer. Whereas in the embodiment,

convolution layer is followed by pooling layer, and pooling layer is followed by activation layer. Dashed line in the FIG. 7A represents activation operation i.e. maximum of value and 0 is calculated. Solid line represents pooling operation i.e. quaternary maximum of four values inside the solid line window is calculated. Therefore dashed line over -9 in NPL 1 represents max(-9,0) is calculated, and result i.e. 0 is saved at appropriate location. Such activation operation is performed for each value from the convolution layer. Solid line over 0, 0, 0, 0 in NPL 1 represents max(0,0,0,0) is calculated, and result i.e. 0 is saved at appropriate location. In NPL 1 there is one binary max operation for each element in the activation layer, since there are sixteen elements in the activation layer i.e. sixteen binary max operations. Also in NPL 1 there is one quaternary max operation for each non-overlapped window, and there are four such windows i.e. (0,6,9,4),(1,0,0,10),(3,0,0,0),(0,0,0,0), also one quaternary max operation for a window consists of three binary max operation. For example, max(0,0,0,0) is calculated as ml=max(0,0), m2=max(ml,0), m3=max(m2,0) where m3=0 is the final output. Therefore a total of 16x1+4x3=28 binary max operations are performed in NPL 1.

In the embodiment, convolution layer is followed by a pooling layer, and pooling layer is followed by activation layer. Solid line over -9, 6, 9, 4 in the

embodiment represents max(-9,6,9,4) is calculated, and results i.e. 9 is saved at appropriate location. Dashed line over -1 represents max(-l,0) is calculated, and result i.e. 0 is saved at appropriate location. In the embodiment one quaternary max operation for each non-overlapped window, and there are four such windows, i.e. (-9,6,9,4),(1,-1,- 8,10),(3,-8,-7,-2),(-l,-5,-7,-9), also one quaternary max operation for a window consists of three binary max operation. For example(-9,6,9,4) is calculated as ml=max(-9,6), m2=max(ml,9),m3=max(m2,4), where m3=9 is the final output. Also in the embodiment one binary max operation is performed for each element on output of pooling layer, i.e. four binary max operations are performed for four elements in the pooling layer.

Therefore a total 4x3+4x1=16 binary max operations, whereas 28 binary max operations were performed in the NPL 1 are performed in the embodiment.

FIG. 7B is an example showing how proposed idea can reduce the number of operations in case, when tanh functions is used for activation operation respectively. And max function is used for pooling layer also there is no overlap between successive pooling operations. FIG. 7B is almost similar to FIG. 7A, only difference is the function used for the activation layer, i.e. FIG 7A uses maximum function for the activation layer, FIG. 7B uses tanh function for the activation layer, but still both the figures uses maximum function in the pooling layer. Therefore a total of 16x1=16 tanh operations and 4x3=12 binary max operations are performed in NPL 1. On contrary a total of 4x3=12 binary max operations and 4x1=4 tanh operations are performed in the embodiment. FIG. 7C is an examples showing how proposed idea can reduce the number of operations in case, when sigmoid functions is used for activation operation respectively. And max function is used for pooling layer also there is no overlap between successive pooling operations. FIG. 7C is almost similar to FIG. 7A, only difference is the function used for the activation layer, i.e. FIG. 7A uses maximum function for the activation layer, FIG. 7C uses sigmoid function for the activation layer, but still both the figures uses maximum function in the pooling layer. Therefore a total of 16x1=16 sigmoid operations and 4x3=12 binary max operations are performed in NPL1. On contrary a total of 4x3=12 binary max operations and 4x1=4 sigmoid operations are performed in the embodiment.

It can be easily seen from these examples FIG. 7A, FIG. 7B, FIG. 7C that the proposed technique reduces the number of operations performed in the activation layer without changing overall output compared to NPL 1.

FIG. 8A is an example showing how proposed idea can reduce the number of operations in case, when max function is used for activation operation and max function is used for pooling layer and there exists overlapping between successive pooling operations. In NPL 1, convolution layer is followed by activation layer, and activation layer is followed by pooling layer. Whereas in the embodiment, convolution layer is followed by pooling layer and pooling layer is followed by activation layer. Dashed line in FIG 8A represents activation operation i.e. maximum of value and 0 is calculated.

Solid line represents pooling operation i.e. maximum of four values inside the solid line window is calculated. Therefore dashed line over -9 in NPL 1 represents max(-9,0) is calculate, and result i.e. 0 is saved at appropriate location. Such activation operation is performed for each value from the convolution layer. Solid line over 3,0,0,0 and 0,0,0,0 in NPL 1 represents max(3, 0,0,0) and max(0,0,0,0) is calculated. This figures differs from FIG. 7A in the sense that in FIG. 8A there is overlap between two solid line i.e. pooling operations. In NPL 1 there is one binary max operation for each element in the activation layer, since there are sixteen elements in the activation layer i.e. 16 binary max operations. Also in NPL 1 there is one quaternary max operation for each

overlapped window, and there are nine such windows i.e. (0,6,9,4), (6,1,4,0), (1,0,0,10), (9,4,3,0), (4,0,0,0), (0,10,0,0), (3,0,0,0), (0,0,0,0), (0,0,0,0), also one quaternary max operation for a window consists of three binary max operation. For example,

max(3,0,0,0) is calculated as ml=max(3,0), m2=max(ml,0), m3=max(m2,0) where m3 = 3 is the final output. Therefore a total of 16x1+9x3=43 binary max operations are performed in NPL 1.

embodiment represents max(-9,6,9,4) is calculated, and results i.e. 9 is saved at appropriate location. Dashed line over -1 represents max(-l,0) is calculated, and result i.e. 0 is saved at appropriate location. In the embodiment, one quaternary max operation for each window, and there are nine such windows, i.e. (-9,6,9,4),(6, 1 ,4,-8),(l ,- 1 ,- 8,10),(9,4,3,-8),(4,-8,-8,-l),(-8,10,-l,-5),(3,-8,-7,-2),(-8,-l,-2,-7),(-l,-5,-7,-9), also one binary max operation for a window consists of three binary max operation. For example(- 9,6,9,4) is calculated as ml=max(-9,6), m2=max(ml,9), m3==max(m2,4), where m3=9 is the final output. Also in the embodiment, one binary max operation is performed for each element in output of pooling layer, i.e. four binary max operations are performed for four elements in the pooling layer. Therefore a total 9x3+9x1=36 binary max operations, whereas 43 binary max operations were performed in the NPL 1 are performed in the embodiment. FIG. 8B is an example showing how proposed idea can reduce the number of operations in case, when tanh functions is used for activation operation respectively. And max function is used for pooling layer also there is overlap between successive pooling operations. FIG. 8B is almost similar to FIG. 8A, only difference is the function used for the activation layer, i.e. FIG 8 A uses maximum function for the activation layer, FIG. 8B uses tanh function for the activation layer, but still both the figures uses maximum function in the pooling layer. Therefore a total of 16x1=16 tanh operations and 9x3=27 binary max operations are performed in NPL 1. On contrary a total of 9x3=27binary max operations and 9x1=9 tanh operations are performed in the embodiment.

FIG. 8C is an examples showing how proposed idea can reduce the number of operations in case, when sigmoid functions is used for activation operation respectively. And max function is used for pooling layer also there is overlap between successive pooling operations. FIG. 8C is almost similar to FIG. 8A, only difference is the function used for the activation layer, i.e. FIG. 8A uses maximum function for the activation layer, FIG. 8C uses sigmoid function for the activation layer, but still both the figures uses maximum function in the pooling layer. Therefore a total of 16x1=16 sigmoid operations and 9x3=27binary max operations are performed in NPL1. On contrary a total of 9x3=27 binary max operations and 9x1=9 sigmoid operations are performed in the embodiment.

It can be easily seen from these examples FIG. 8 A, FIG. 8B, FIG. 8C that the proposed technique reduces the number of operations performed in the activation layer without changing overall output of the 3 operations, even when there exists overlapping between the pooling operations.

FIG. 9 shows an example how PTL 1 can change the output after swapping activation and pooling layer when average function is used for pooling. In NPL 1 first in the activation layer max of each element and 0 is determined, for example max(-5,0). There are 4 such elements i.e. -5, 5, -5, 5. In NPL 1 in Pooling layer average function is then used to calculate average of four elements i.e. avg(0,0,5,5) = 2.5. Whereas in PTL 1 which suggest performing pooling before activation, changes the output. First average of four elements is taken i.e. avg(-5,5,-5,5) = 0 and then max of element with 0 i.e. max(0,0) = 0. Hence it can be seen that the output of the NPL 1 i.e. 2.5 is different from PTL 1 i.e. 0. In the embodiment such a case will never exists because it will be found in step S640 that the function used for the pooling layer is not a max function and hence such a swapping or reconfiguration is not safe in our proposed idea.

It should be noted that a program capable of implementing functionalities of the information processing method according to the present invention may be recorded in a non-transitory computer readable medium, and the operations of identifying target portions of a neural network to be optimized (i.e., swapping the activation layer and pooling layer of the neural network), and the like may be performed by causing a computer system to read and execute the program recorded in the computer readable medium. The term "computer system" used herein refers to software such as an operating system (OS) or hardware devices such as peripherals. In addition, the

"computer system" may also include a world wide web (WWW) system capable of providing a website environment (or a display environment). Further, the term

"computer readable media" refers to portable media such as a flexible disk, a magneto- optical (MO) disc, a read-only memory (ROM), and a compact disc (CD) ROM, and a storage device built in the computer system such as a hard disk. Moreover, the

"computer readable media" includes media capable of maintaining the program during a certain period of time, such as a volatile memory (random-access memory (RAM)) inside the computer system serving as a server or a client when the program is transmitted via network such as the Internet or a communication line such as a telephone line. The program may be transmitted from the computer system in which the program is stored in, for example, the storage device, to another computer system through transmission media or transmission waves in the transmission media. Here, the term "transmission media" for transmitting the program refers to media capable of transmitting information like a network (communication network) such as the Internet or a communication circuit (communication line) such as a telephone line. Furthermore, the program may also include a program for implementing a part of the aforementioned functionalities and include a discrete file (discrete program) in which the aforementioned functionalities are implemented in combination with a program that has already been recorded in the computer system.

While preferred embodiments of the invention have been described and illustrated above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description, and is only limited by the scope of the appended claims.

INDUSTRIAL APPLICABILITY

The present invention can be applied to the field of data processing, particularly image processing, text processing, speech processing, and machine learning. Reference Signs List

110 processor

111 Registers

120 cache subsystem 130 GPU subsystem

140 graphics output device(s)

150 memory bridge

160 I/O subsystem

170 mouse/keyboard

180 memory subsystem

181 OS

182 driver

183 application

190 secondary storage

200 neural network

201 reconfigured neural network

210 input

220 convolution layer

230 activation layer

240 pooling layer

250 output/ fully connected layer

290 softmax

300 swapping technique

Claims

1. An information processing device for a neural network, the information processing device comprising:

a neural network reconfiguration unit configured to swap an order of activation processing and pooling processing in a target neural network, when the activation processing is a non-decreasing function and the pooling processing is a max function, which is a portion of the target neural network in which convolution processing, the activation processing, and the pooling processing occur in order; and

a processing unit configured to process input data by using a reconfigured neural network by the neural network reconfiguration unit.

2. The information processing device of claim 1, further comprising:

a neural network analyzation unit configured to analyze the target neural network by identifying a target portion to be reconfigured by the neural network reconfiguration unit.

3. A computer-implemented information processing method for a neural network, the method comprising:

identifying a target portion in which the neural network is configured to perform, in order, convolution processing, activation processing, and pooling processing;

when the activation processing of the target portion is a non-decreasing function and the pooling processing of the target portion is a max function, swapping the order of the activation processing and the pooling processing in the target portion of the neural network so as to reconfigure the neural network; and

processing input data using the reconfigured neural network.

4. A non-transitory computer readable medium containing program instructions for causing a computer to perform the method of claim 3.