WO2022153361A1

WO2022153361A1 - Training apparatus, control method, and computer-readable storage medium

Info

Publication number: WO2022153361A1
Application number: PCT/JP2021/000671
Authority: WO
Inventors: Charvi VITTHAL; Florian BEYE; Koichi Nihei; Hayato Itsumi; Yusuke Shinohara; Takanori Iwai
Original assignee: Nec Corporation
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2022-07-21

Abstract

A training apparatus (2000) acquires a training dataset (10) including an input data (12). The training apparatus (2000) inputs the input data (12) into a target neural network (20) that includes a plurality of feature extraction layers (22). The feature extraction layer (22) generates a feature map (30) that includes a plurality of feature channels (32). The training apparatus (2000) acquires one or more of the feature maps (30) from the target neural network (20). The training apparatus (2000) computes inter-feature redundancy or intra-feature redundancy regarding the acquired feature map (30). The inter-feature redundancy represents redundancy between the feature channels (32) in the same feature map (30). The intra-feature redundancy represents redundancy in the feature channel (32). The training apparatus (200) trains the target neural network (20) using the inter-feature redundancy, the intra-feature redundancy, or both. The inter-feature redundancy is not computed as a correlation between the feature channels (32).

Description

TRAINING APPARATUS, CONTROL METHOD, AND COMPUTER-READABLE STORAGE MEDIUM

　　　　The present disclosure generally relates to training of neural networks.

　　　　Neural networks are widely used for various types of predictions. Those predictions are performed using information which the neural network extracts from an input data. Thus, it is important for the neural network to be able to efficiently extract information from the input data in order to perform an accurate prediction.

　　　　However, redundancy in the neural network hinder efficient extractions of the information from the input data. For example, the redundancy can be manifested in the form of multiple network components learning the same information. Such components simply increase the computational cost of the network, without increasing the accuracy of the prediction.

　　　　Regarding a way of reducing the redundancy in the neural network, NPL 1 discloses a technique to train a neural network so as to reduce a correlation between feature channels in the same feature map, thereby reducing the redundancy between the feature channels in the same feature map. Note that the feature channel refers to features that is extracted from data using a single filter, whereas the feature map refers to a set of feature channels that are extracted from the same data using different filters in a single layer.

　　　　NPL1: Jie Guo, Tingfa, Xu, and Ziyi Schen, "Stochastic Channel Decorrelation Network and its application to Visual Tracking", Computer Research Repository, arXiv:1807.01103, August 20, 2018

　　　　An objective of the present disclosure is to provide a new technique to reduce a redundancy in a neural network.

　　　　The present disclosure provides a training apparatus that comprises at least one processor; and memory storing instructions. The at least one processor is configured to execute the instructions to: acquire a training dataset including an input data; input the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels; acquire one or more of the feature maps from the target neural network; compute inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and train the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both. The inter-feature redundancy is not computed as a correlation between the feature channels.

　　　　The present disclosure provides a control method performed by a computer. The control method comprises: acquiring a training dataset including an input data; inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels; acquiring one or more of the feature maps from the target neural network; computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both. The inter-feature redundancy is not computed as a correlation between the feature channels.

　　　　The present disclosure provides a computer-readable storage medium storing a program that causes a computer to perform: acquiring a training dataset including an input data; inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels; acquiring one or more of the feature maps from the target neural network; computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both. The inter-feature redundancy is not computed as a correlation between the feature channels.

　　　　　　According to the present disclosure, a new technique to reduce a redundancy in a neural network is provided.

Fig. 1 illustrates an overview of a training apparatus 2000 of the 1st example embodiment. Fig. 2 is a block diagram illustrating an example of a functional configuration of the training apparatus. Fig. 3 is a block diagram illustrating an example of the hardware configuration of a computer realizing the training apparatus. Fig. 4 is a flowchart illustrating an example of an overall flow of process performed by the training apparatus. Fig. 5 shows results of the experiment using the Balloon dataset. Fig. 6 shows results of the experiment using the COCO dataset. Fig. 7 shows experimental results regarding 3D object detection on 3D data. Fig. 8 shows experimental results regarding 3D object detection on BEV data.

　　Example embodiments according to the present disclosure will be described hereinafter with reference to the drawings. The same numeral signs are assigned to the same elements throughout the drawings, and redundant explanations are omitted as necessary.

　　　　FIRST EXAMPLE EMBODIMENT
　　　　<Overview>
　　　　Fig. 1 illustrates an overview of a training apparatus 2000 of the 1st example embodiment. Note that the overview illustrated by Fig. 1 shows an example operation of the training apparatus 2000 to make it easy to understand the training apparatus 2000, and does not limit or narrow the scope of possible operations of the training apparatus 2000.

　　　　The training apparatus 2000 is used to train a neural network. The neural network trained by the training apparatus 2000 is described as being "target neural network 20". There are a plurality of layers in the target neural network 20. Specifically, the target neural network 20 includes, at least, a plurality of feature extraction layers 22. The feature extraction layer 22 obtains data from the previous layer, performs predefined computations on the obtained data to generate a feature map 30, and outputs the generated feature map 30 to the next layer. In the case where the first layer of the target neural network 20 is a feature extraction layer 22, this layer 22 obtains the input data 40 instead of data generated by the previous layer since there is no layer previous to this layer 22.

　　　　Note that a feature extraction layer may also be called "a convolution layer" in the case where the target neural network 20 is a convolutional neural network (CNN). However, the type of the target neural network 20 is arbitrary, and not limited to CNN.

　　　　Also note that, although Fig. 1 depicts the target neural network 20 as having only the feature extraction layers 22, the target neural network 20 may further include other types of layers, such as a pooling layer, a ReLU layer, or a fully connected layer.

　　　　The feature extraction layer 22 has multiple filters (kernels), and performs computation on the obtained data using each filter, thereby generating a feature channel 32 for each filter. As a result, a set of feature channels 32 is obtained as the feature map 30. Suppose that the feature extraction layer 22 has three filters F1, F2, and F3, and obtains data D1. In this case, the feature extraction layer 22 performs a computation on the data D1 using the filer F1 to generate a feature channel C1, performs a computation on the data D1 using the filer F2 to generate a feature channel C2, and performs computation on the data D1 using the filer F3 to generate a feature channel C3. As a result, the feature map 30 including the feature channels C1, C2, and C3 is output to the next layer.

　　　　In order to train the target neural network 20, the training apparatus 2000 obtains a training dataset 10 including input data 12 and ground truth data 14. The ground truth data 14 represents an ideal output data that is to be output from the target neural network 20 in response to the corresponding input data 12 being input into the target neural network 20.

　　　　The training apparatus 2000 trains the target neural network 20 while taking redundancy in the features computed by the target neural network 20 into account. Specifically, the training apparatus 2000 inputs the input data 12 to the target neural network 20, and obtains one or more feature maps 30 that are generated in response to the input of that input data 12. Then, the training apparatus 2000 computes a redundancy index that represents the redundancy in the obtained one or more feature maps 30. The training apparatus 2000 updates the target neural network 20. The target neural network 20 is updated by updating trainable parameters of the target neural network 20, such as weights between nodes, using the redundancy index.

　　　　The redundancy index computed by the training apparatus 2000 may represent inter-feature redundancy or intra-feature redundancy. The inter-feature redundancy is one existing among a plurality of the feature channels 32 in a single feature map 30. On the other hand, the intra-feature redundancy is one existing in a single feature channel 32. Note that, the training apparatus 2000 may compute both the redundancy index representing the inter-feature redundancy and the redundancy index representing the intra-feature redundancy. Hereinafter, the redundancy index representing the inter-feature redundancy and the redundancy index representing the intra-feature redundancy are described as being "inter-feature redundancy index" and "intra-feature redundancy index", respectively.

　　　　<Example of Advantageous Effect>
　　According to the training apparatus 2000, the inter-feature redundancy between the feature channels 32, the intra-feature redundancy in the feature channel 32, or both are computed, and the target neural network 20 is trained taking those redundancy into account. By doing so, the redundancy in the feature maps 30 (in other words, the redundancy in the target neural network 20) can be reduced.

　　　　An effect of reducing the redundancy in the feature maps 30 is an increase of the accuracy of the target neural network 20. Specifically, by reducing the redundancy in the feature maps 30, the target neural network 20 can extract more pieces of meaningful information from an input data, and can therefore perform a prediction more accurately.

　　　　Note that reducing the size of the neural network is another way of reducing the redundancy in the neural network. However, the size reduction of the neural network would also reduce an accuracy of the neural network. On the other hand, the training apparatus 2000 reduces the redundancy in the target neural network 20 without reducing the size of the target neural network 20. Therefore, the training apparatus 2000 can effectively improve the accuracy of the target neural network 20.

　　Hereinafter, more detailed explanation of the training apparatus 2000 will be described.

　　　　<Example of Functional Configuration>
　　　　Fig. 2 illustrates an example of a functional configuration of the training apparatus 2000. The training apparatus 2000 includes a training dataset acquisition unit 2020, a feature map acquisition unit 2040, a redundancy computation unit 2060, and an update unit 2080. The training dataset acquisition unit 2020 acquires the training dataset 10. The feature map acquisition unit 2040 inputs the input data 12 in the training dataset 10 into the target neural network 20, and acquires one or more feature maps 30 from the target neural network 20. The redundancy computation unit 2060 computes the inter-feature redundancy index, the intra-feature redundancy, or both using the feature map 30. The update unit 2080 updates the target neural network 20 based on the computed inter-feature redundancy index, the computed intra-feature redundancy index, or both.

　　　　<Example of Hardware Configuration>
　　　　The training apparatus 2000 may be realized by one or more computers. Each of the one or more computers may be a special-purpose computer manufactured for implementing the training apparatus 2000, or may be a general-purpose computer like a personal computer (PC), a server machine, or a mobile device.

　　　　The training apparatus 2000 may be realized by installing an application in the computer. The application is implemented with a program that causes the computer to function as the training apparatus 2000. In other words, the program is an implementation of the functional units of the training apparatus 2000.

　　　　Fig. 3 is a block diagram illustrating an example of the hardware configuration of a computer 1000 realizing the training apparatus 2000. In Fig. 3, the computer 1000 includes a bus 1020, a processor 1040, a memory 1060, a storage device 1080, an input/output interface 1100, and a network interface 1120.

　　　　The bus 1020 is a data transmission channel in order for the processor 1040, the memory 1060, the storage device 1080, and the input/output interface 1100, and the network interface 1120 to mutually transmit and receive data. The processor 1040 is a processer, such as a CPU (Central Processing Unit), GPU (Graphics Processing Unit), or FPGA (Field-Programmable Gate Array). The memory 1060 is a primary memory component, such as a RAM (Random Access Memory) or a ROM (Read Only Memory). The storage device 1080 is a secondary memory component, such as a hard disk, an SSD (Solid State Drive), or a memory card. The input/output interface 1100 is an interface between the computer 1000 and peripheral devices, such as a keyboard, mouse, or display device. The network interface 1120 is an interface between the computer 1000 and a network. The network may be a LAN (Local Area Network) or a WAN (Wide Area Network).

　　　　The storage device 1080 may store the program mentioned above. The CPU 1040 executes the program to realize each functional unit of the training apparatus 2000.

　　　　The hardware configuration of the computer 1000 is not limited to the configuration shown in Fig. 3. For example, as mentioned-above, the training apparatus 2000 may be realized by plural computers. In this case, those computers may be connected with each other through the network.

　　　　The target neural network 20 may be implemented in the computer same as that implements the training apparatus 2000 (i.e. the computer 1000) or in a computer different from that implements the training apparatus 2000. In the latter case, the computer implementing the target neural network 20 may have a hardware configuration similar to or the same as the computer 1000 as depicted Fig. 3.

　　　　<Flow of Process>
　　　　Fig. 4 is a flowchart illustrating an example of a process performed by the training apparatus 2000. The training dataset acquisition unit 2020 acquires the input data 10 (S102). The feature map acquisition unit 2060 inputs the input data 12 into the target neural network 20 (S104). The feature map acquisition unit 2060 acquires the feature map 30 from the target neural network 20 (S106). The redundancy computation unit 2060 computes the redundancy index using the feature map 30 (S108). The update unit 2080 updates the target neural network 20 based on the computed redundancy index (S110).

　　　　Note that the training apparatus 2000 may use a plurality of the training datasets 10 to train the target neural network 20. Thus, the process depicted by Fig. 4 may be repeatedly performed for each of the plurality of the input data 10. However, the training dataset acquisition unit 2020 may acquire a plurality of the training datasets 10 all at once.

　　　　<Acquisition of Training Dataset 10: S102>
　　　　The training dataset acquisition unit 2020 acquires the training dataset 10 (S102). There are various ways to acquire the training dataset 10. For example, the training dataset acquisition unit 2020 acquires the training dataset 10 from a storage device in which the training dataset 10 is stored in advance and to which the training apparatus 2000 has access. In another example, the training dataset acquisition unit 2020 receives the training dataset 10 that is sent from another computer.

　　　　<Acquisition of Feature Map 30: S104, S106>
　　　　The feature map acquisition unit 2040 inputs the input data 12 in the input data set 10 into the target neural network 20 (S104), and acquires one or more feature maps 30 that are generated in response to the input of that input data 12 (S106). For this reason, the feature extraction layers 22 may be configured to output the feature map 30 not only to the next layer but also to the outside of the target neural network 20 at least in a training phase so that the feature map acquisition unit 2040 can obtain that feature map 30. For example, the feature extraction layer 22 is configured to output the feature map 30 into a storage device to which the training apparatus 2000 has access.

　　　　The feature map acquisition unit 2040 may acquire all of the feature maps 30 generated in response to the input of the input data 12, or acquire some of them. In the latter case, the feature map acquisition 2040 may obtain the feature map 30 from each one of a predetermined number of the feature extraction layers 22. For example, the feature map acquisition 2040 chooses the predetermined number of the feature extraction layers 22 in a random manner, and obtains the feature map 30 from each of those feature extraction layers 22. In another example, the feature extraction layers 22 from which the feature map extraction unit 2040 is to obtain the feature map 30 are specified by a user of the training apparatus 20 in advance.

　　　　<Computation of Redundancy in Features: S108>
　　　　The redundancy computation unit 2060 computes the redundancy index that represents redundancy in the features computed by the target neural network 20 (S108). Specifically, as described above, the inter-feature redundancy index, the intra-feature redundancy index, or both, may be computed. Hereinafter, both types of redundancy index are explained in detail.

　　　　<<Inter-Feature Redundancy Index>>
　　　　For each of the obtained feature maps 30, the redundancy computation unit 2060 may compute the inter-feature redundancy index representing the redundancy in one or more pairs of the feature channels 32. Note that the feature channels 32 in the same pair are included in the same feature map 30. Suppose that the feature map acquisition unit 2040 acquires the feature map M1 that includes three feature channels C11, C12 and C13, and the feature map M2 that includes four feature channels C21, C22, C23, and C 24. In this case, for example, the redundancy computation unit 2060 may compute the inter-feature redundancy index for all or some of nine possible pairs of the feature channels: (C11,C12), (C11,C13), (C12,C13), (C21,C22), (C21,C23), (C21,C24), (C22,C23), (C22,C24), and (C23,C24).

　　　　In the case where the inter-feature redundancy index is computed for some of the pairs of the feature channels 32, not all of them, in the feature map 30, the redundancy computation unit 2060 may, for example, generate a predetermined number of pairs of the feature channels 32 in a random manner, and computes the inter-feature redundancy index for each of the generated pairs of the feature channels 32. In another example, for which pairs of the feature channels 32 the inter-feature redundancy index is to be computed may be predetermined: e.g. specified in advance by the user of the training apparatus 2000. In this case, the feature redundancy computation unit 2060 computes the inter-feature redundancy index for each of the predetermined pairs of the feature channels 32.

　　　　When the inter-feature redundancy index is computed for each of the plural pairs of the feature channels 32, the redundancy computation unit 2060 may output a statistical value of the multiple inter-feature redundancy indices, such as a total value, a mean value, a median value, or a maximum value thereof. Suppose that the feature map acquisition unit 2040 acquires the feature map M1 that includes three feature channels C11, C12 and C13, and the feature map M2 that includes four feature channels C21, C22, C23, and C 24 as exemplified above. In this case, for example, the redundancy computation unit 2060 may compute the inter-feature redundancy index for each one of all nine possible pairs of the feature channels 32, and output the statistical value of the computed nine redundancy indices.

　　　　There are various concrete ways of computing the inter-feature redundancy index. For example, the redundancy computation unit 2060 computes a correlation between two feature channels 32 as the inter-feature redundancy index, based on the following equation (1).
　　　　Equation (1)

　　　
　　　　In the equation (1), R_inter(C_X,C_Y) represents the inter-feature redundancy index. Note that "_" is used to represent a subscript: e.g. C with a subscript X is denoted by C_X. C_X and C_Y are the feature channels 32 that are represented as vectors and for which the inter-feature redundancy is to be computed. μ_C_X and μ_C_Y represent the mean values of the elements of C_X and C_Y, respectively. σ_C_X and σ_C_Y represent the standard deviations of the elements of C_X and C_Y, respectively. |C_X| and |C_Y| represent the number of the elements of C_X and C_Y, respectively.

　　　　A high correlation between two feature channels 32 denotes that their corresponding neurons or filters are becoming high or low together: i.e. they are being activated by the same information; or they are learning the same information. This is a redundancy that should be removed from the target neural network 20. Hence, the correlation between two feature channels 32 can be used as a measure of the inter-feature redundancy between them.

　　　　In another example, the redundancy computation unit 2060 computes mutual information between the feature channels 32 as the inter-feature redundancy index. Mutual information is a direct measure of the information overlap between two feature channels 32. The higher the mutual information between the feature maps 32 is, the higher the redundancy between them is. This means that the mutual information between two feature channels can be used as a measure of the inter-feature redundancy between them.

　　　　To make the computation easier, the feature channels 32 can be modelled as a probability distribution, e.g. the Gaussian probability distribution N(μ,σ). In the case where the feature channels 32 are modeled as the Gaussian probability distribution, the inter-feature redundancy index R_inter(C_X,C_Y) representing the mutual information between the feature channels C_X and C_Y is defined by the following equation (2):
　　　　Equation (2)

　　　　　
　　　　In the equation (2), C_X is interpreted as a set of samples of the corresponding Gaussian random variable X. Similarly, C_Y is interpreted as a set of samples of the corresponding Gaussian random variable Y. σ_X and σ_Y represent the standard deviations of X and Y, respectively. cov(X,Y) is the covariance of X and Y.

　　　　In another example, the redundancy computation unit 2060 computes kernel-based distance as the inter-feature redundancy index. Positive definite kernels can be used to quantify the distance between two data. Some examples of positive definite kernels are the Gaussian Kernel or the Laplacian Kernel. For instance, the inter-feature redundancy index R_inter(C_X,C_Y) representing the Gaussian Kernel based distance between the feature channels C_X and C_Y is defined by the following equation (3):
　　　　Equation 3

　　　　
　　　　Note that γ is a predefined hyper-parameter.

　　　　In order to compute the inter-feature redundancy defined by the above-mentioned equations, the redundancy computation unit 2060 may estimate parameters in the equations using the feature maps 32 as shown in the following equations (4) and (5).
　　　　Equation 4

　　　　
　　　　Equation 5

　　　　　
　　　　Note that |C_X| represents the number of the elements of the feature channel C_X.

　　　　Note that the above-mentioned ways of computing the inter-feature redundancy are examples, and other ways may be also applied. In addition, the redundancy computation unit 2060 may compute the inter-feature redundancy of the feature map 30 using a combination of different types of the above-mentioned inter-feature redundancies. For example, the weighted average of two or more types of the inter-feature redundancies can be used.

　　　　<<Intra-Feature Redundancy>>
　　　　The intra-feature redundancy represents the redundancy in a single feature channel 32. Suppose that the feature map acquisition unit 2040 acquires the feature map M1 that includes three feature channels C11, C12 and C13, and the feature map M2 that includes four feature channels C21, C22, C23, and C24 as exemplified above. In this case, for example, the redundancy computation unit 2060 may compute the intra-feature redundancy index for each of C11, C12, C13, C21, C22, C23, and C24. However, the redundancy computation unit 2060 may compute the intra-feature redundancy index for some of the feature channels 32 in the acquired feature maps 30, not all of them. In this case, the redundancy computation 2060 may, for example, choose a predetermined number of the feature channels 32 in a random manner from the feature maps 30. In another example, for which feature channels 32 the inter-feature redundancy index is to be computed may be predetermined: e.g. specified in advance by the user of the training apparatus 2000. In this case, the feature redundancy computation unit 2060 computes the intra-feature redundancy index for each of the predetermined feature channels 32.

　　　　When the intra-feature redundancy index is computed for multiple feature channels 32, the redundancy computation unit 2060 may output a statistical value of the multiple intra-feature redundancy indices, such as a total value, a mean value, a median value, or a maximum value thereof. Suppose that the intra-feature redundancy index computed for the feature channel C is denoted by R_intra(C), and the intra-feature redundancy index is computed for each of the feature channels C1, C2, and C3. In this case, the redundancy computation unit 2060 may compute the statistical value of R_intra(C1), R_Intra(C2), and R_Intra(C3): e.g. an average of these three values.

　　　　There are various concrete ways of computing the intra-feature redundancy index. For example, the redundancy computation unit 2060 may use an entropy of the feature channel 32 to measure the intra-feature redundancy in the feature channel 32. Specifically, it is considered that the higher the entropy of the feature channel 32 is, the lower the intra-feature redundancy in the feature channels 32 is due to the following reason. In the feature channel 32, specific parts could be high based on the specific information present. In other words, the presence of specific information in the data would cause some of the network components (neurons or filters) to give a high activation but some other components to give a low activation to the feature channel. Although a feature channel 32 which is constantly high might also be an important piece of information, a multiplicity of such feature channels might simply be caused by the network layer not learning anything, and hence be redundant.

　　　　Thus, the redundancy computation unit 2060 computes the intra-feature redundancy index for the feature channel 32 based on the entropy of the feature channel 32. Similar to the equation (2), the feature channels 32 can be modelled as a probability distribution. In the case where the feature channel 32 is modeled as the Gaussian probability distribution, the intra-feature redundancy index R_intra(C_X) can be computed, for example, using the following equation (6).
　　　　Equation 6

　　　　In another example, the redundancy computation unit 2060 may use a sparsity of the feature channel 32 to measure the intra-feature redundancy in the feature channel 32, for the same motivation as that in the case of using entropy as the intra-feature redundancy. Specifically, a high value in certain sections of the feature channel 32 represents the presence or absence of specific information. The motivation behind increasing sparsity is to reduce the representation of the same information multiple times in the layer.

　　　　Thus, the redundancy computation unit 2060 may compute the intra-feature redundancy index for the feature channel 32 based on the sparsity of the feature channel 32. For example, the sparsity of the feature channel 32 can be represented by L1 norm thereof. In this case, the intra-feature redundancy index for the feature channel C_X can be computed, for example, using the following equation (7).
　　　　Equation 7

　　　　Note that the above-mentioned ways of computing the intra-feature redundancy are examples, and other ways may be also applied. In addition, the redundancy computation unit 2060 may compute the intra-feature redundancy of the feature channel 32 using a combination of different types of the above-mentioned inter-feature redundancies. For example, the weighted average of the different types of the intra-feature redundancies can be used.

　　　　<Update of Target Neural Network 20: S110>
　　　　The update unit 2080 updates the target neural network 20 using the redundancy index (S110). Basically, a neural network can be updated based on a difference between a prediction obtained from the neural network in response to the insertion of an input data and the ground truth data corresponding to that input data (in other words, the difference between an actual output and an ideal output of the neural network). This difference is called "loss." The objective of the update of the neural network is usually to minimize the loss.

　　　　In order to also take redundancy in the features into consideration, the update unit 2060 computes the loss based not only on the difference between the actual output and the ideal output of the target neural network 20, but also on the redundancy index computed by the redundancy computation unit 2060. Note that, hereinafter, the loss representing the difference between the actual output and the ideal output is described as "task-specific loss", whereas the loss in which both the task-specific loss and the redundancy computed by the redundancy computation unit 2060 are taken into account is described as "overall loss". The update unit 2060 computes an overall loss in response to the input of the input data 12, and updates the target neural network 20 based on the overall loss.

　　　　For instance, the overall loss is computed by the following equation (8).
　　　　Equation 8

　　　　　
　　　　In the equation (8), L_all and L_task represent the overall loss and the task-specific loss, respectively. R_inter and R_intra represent the inter-feature redundancy index and the intra-feature redundancy index, respectively. Coefficients λ1 and λ2 are pre-determined real numbers larger than 0.

　　　　Note that in the case where the inter-feature redundancy index is computed for each of multiple pairs of the feature channels 32, R_inter in the equation (8) represents the statistical value of the multiple inter-feature redundancy indices. Similarly, in the case where the intra-feature redundancy index is computed for each of multiple feature channels 32, R_intra in the equation (8) represents the statistical value of the multiple intra-feature redundancy indices.

　　　　In addition, although the equation (8) includes both the inter-feature redundancy index and the intra-feature redundancy index, the redundancy computation unit 2060 may compute just one of them. In the case where the redundancy computation unit 2060 does not compute the inter-feature redundancy index, the term "λ1*R_inter" is removed from the equation (8). On the other hand, in the case where the redundancy computation unit 2060 does not compute the intra-feature redundancy index, the term "λ2*R_intra" is removed from the equation (8).

　　　　The task-specific loss is computed based on the difference between the actual output obtained from the target neural network 20 in response to the input of the input data 12 and the ground truth data 14 corresponding to that input data 12. For example, the update unit 2080 computes the task-specific loss by evaluating a predetermined loss function for the actual output of the target neural network 20 and the ground truth data 14.

　　　　The update unit 2080 trains the target neural network 20 using the overall loss. Specifically, the update unit 2080 updates trainable parameters, such as weights between nodes, of the target network 20 using the overall loss. Note that there are various well-known techniques to update trainable parameters of a neural network based on a computed loss. The update unit 2080 may use any of those well-known techniques to update the target neural network 20 using the overall loss.

　　　　<Output of Training Apparatus 2000>
　　　　The training apparatus 2000 may output the parameters updated by the update unit 2080 in various manners. For example, the training apparatus 2000 may put the updated parameters into a storage device. In another example, the training apparatus 2000 may send the updated parameters to another computer. Specifically, in the case where the target neural network 20 is implemented in a computer different from that implementing the training apparatus 2000 (i.e. computer 1000), the training apparatus 2000 sends the updated parameters to the computer implementing the target neural network 20 so that the updated parameters are applied to the target neural network 20.

　　　　<Training Additional Parameters for Redundancy Computation>
　　　　As described above, some types of the redundancy index require parameters to compute them, such as the mean value or the standard deviation of the feature channel 32 used in the equation (1). The training apparatus 2000 may also handle those parameters as trainable parameters of the neural network 20. For example, each feature channel 32 is modelled as having specific properties, such as a specific probability distribution. More specifically, the feature channels 32 in the same feature map 30 may be assumed to be parts of a joint multi-variate Gaussian distribution N(μ,Σ). In this case, the mean vector μ and the covariance matrix Σ given in the following equations (9) and (10) are considered as trainable parameters, and therefore can be trained during the training of the target neural network 20.
　　　　Equation 9

　　　　
　　　　In the equation (9), n represents the number of the feature channels in the feature map 30 for which the redundancy is computed. μ_i represents the mean value of the elements of the feature channel 32 with the identifier i, denoted by C_X_i.
　　　　Equation 10

　　　　
　　　　In the equation (10), X_i represents a random variable whose value is randomly selected from the feature channel C_X_i.

　　　　Note that the mean values μ_i and μ_j in the mean vector can be used as μ_C_X_i and μ_C_X_j in the equation (1) to compute the correlation between the feature channels C_X_i and C_X_j. The standard deviations σ_X_i and σ_X_j in the covariance matrix can be used to compute the mutual information between the feature channels C_X_i and C_X_j as described in the equation (2). The covariance cov(X_i,X_j) in the covariance matrix can be used to compute the correlation between the feature channels C_X_i and C_X_j as described in the equation (2).

　　　　In order to train the mean vector and the covariance matrix, the update unit 2080 computes losses for each of them. Hereinafter, the loss regarding the mean vector is denoted by L_mean, whereas the loss regarding the covariance matrix is denoted by L_var. For example, L_mean and L_var are computed using the following equations (11) and (12), respectively:
　　　　Equation 11

　　　　　
　　　　Equation 12

　　　　The update unit 2060 may compute the overall loss of the target neural network 20 taking L_mean and L_var into account. Specifically, for example, the overall loss is computed using the following equation (13):
　　　　Equation 13

　　　　　
　　　　Note that coefficients λ3 and λ4 are pre-determined real numbers larger than 0.

　　　　The update unit 2060 updates the trainable parameters, including the mean vector and the covariance matrix, based on the overall loss computed using the equation (13). Regarding the mean vector and the covariance matrix, for example, the update unit 2060 computes gradients of the overall loss with respect to each parameter in the mean vector and the covariance matrix, and updates each parameter based on the gradient computed with respect to that parameter.

　　　　<Example of Experimental Result>
　　　　Hereinafter, experimental results regarding example implementations of the training apparatus 2000 will be described. The training apparatus 2000 is tested on two different types of tasks: 2D and 3D object detection. The task in object detection is to decide whether an object is present or not. In addition, in the case where an object is detected, it is also determined what type of object the detected one is and where the detected object is located. As the name suggests, 2D object detection detects objects in 2D images and two different datasets are used in this study: Balloon dataset and COCO dataset. The input to a 3D object detector is LiDAR point cloud data and the KITTI dataset is used. All experiments are run on a "Tesla V100-PCIE-16GB" GPU. The results for each of these is discussed in this section.

　　　　<<2D object detection>>
　　　　Two different datasets are used for the 2D objection detection task: COCO dataset and Balloon dataset. The COCO dataset is "a large-scale object detection, segmentation and captioning dataset" containing 80 object classes. The training dataset contains 117,266 images and the test data contains 5,000 images. Balloon dataset is a smaller dataset with only one object class: Balloons. There are 61 images in the training dataset and 13 images in the test dataset.

　　　　The Detectron2 package, written in pytorch, is used in the experiments. The model used is Faster RCNN with a ResNet-50 backbone. The optimizer used is stochastic gradient descent (SGD) with momentum.

　　　　The main evaluation metric used is Average Precision (AP) which is defined as the area under the precision-recall curve. Intersection-over-Union (IoU) threshold is used to decide whether a bounding box detected actually belongs to an object or not. Using these predictions, precision and recall are computed which are then used to compute the AP. The techniques to compute the area under the precision -recall curve have been defined slightly differently for different challenges like PASCAL VOC, COCO. Here, we follow the COCO standard which involves computing the following metrics:
　　　　1. AP (or mean AP (mAP)): mean of the AP at IoU thresholds varying from 0.5 to 0.95 with a step size of 0.05
　　　　2. AP50: AP at IoU threshold = 0.5
　　　　3. AP75: AP at IoU threshold = 0.75
　　　　4. APs: AP for small objects (area < 32²)
　　　　5. APm: AP for medium objects (32²< area < 96²)
　　　　6. APl: AP for large objects (area > 96²)

　　　　Fig. 5 shows results of the experiment using the Balloon dataset. In this experiment, three types of neural networks are compared: Original network, Prior Art network, and Example Embodiment network. The Original network is a neural network that is trained only based on the task-specific loss; the redundancy in the network is not taken into consideration. The Prior Art network is a neural network that is trained in a way disclosed by NPL1. The Example Embodiment network is a neural network that is trained by an example implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as mutual information between the feature channels 32.

　　　　Fig. 6 shows results of the experiment using the COCO dataset. In this experiment, two types of neural networks are compared: Original network and Example Embodiment network. The Example Embodiment network in this experiment is trained by another example implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as a Gaussian kernel based distance between the feature channels 32.

　　　　<<3D object detection>>
　　　　The KITTI dataset is used for the 3D object detection task. Among the multiple classes present, the experiments focus only on detecting cars. In the experiments, 3,712 samples in the training data and 3,769 samples in the testing data are used. The Structure Aware Single Stage Detector (SA-SSD) network, which takes the LiDAR point cloud data as input, is used in the experiments. Average Precision (AP) at Intersection over Union (IoU) threshold of 0.7 and computed as the specified standard (PASCAL standard), is used for evaluation.

　　　　Figs. 7 and 8 show experimental results regarding 3D object detection. Specifically, in the experiments whose results shown in Fig. 7, the 3D object detection is performed on 3D data. On the other hand, in the experiments whose results shown in Fig. 8, the 3D object detection is performed on BEV (Bird's Eye View) data.

　　　　In these experiments, five example implementations of the training apparatus 2000 are compared with the Original network. The network denoted by "Correlation (Direct)" is a neural network that is trained by an example implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as a correlation between the feature channels 32. The network denoted by "Corr + Gaussian Kernel (Joint)" is a neural network that is trained by an example implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as the summation of the correlation of the feature channels 32 and the Gaussian kernel based distance between the feature channels 32, and the parameters of the correlation are trained as explained with the equation (9) and (10). The network denoted by "Corr (Joint)" is a neural network that is trained by an example implementation of the training apparatus 2000 in which the correlation between the feature channels 32 is used as the inter-feature redundancy and the parameters of the correlation are trained as explained with the equation (9) and (10). The network denoted by "Mutual Info (Direct)" is a neural network that is trained by an implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as mutual information between the feature channels 32. The network denoted by "Gaussian Kernel" is a neural network that is trained by an implementation of the training apparatus 2000 in which the inter-feature redundancy is computed as a Gaussian kernel based distance between the feature channels 32 where gamma is set to 1.

　　　　Although the present disclosure is explained above with reference to example embodiments, the present disclosure is not limited to the above-described example embodiments. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present disclosure within the scope of the invention.

　　　　The whole or part of the example embodiments disclosed above can be described as, but not limited to, the following supplementary notes.
　　　　<Supplementary notes>
　　(Supplementary Note 1)
　　A training apparatus comprising:
　　at least one processor; and
　　memory storing instructions,
　　wherein the at least one processor is configured to execute the instructions to:
　　　　acquire a training dataset including an input data;
　　　　input the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　　　acquire one or more of the feature maps from the target neural network;
　　　　compute inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　　　train the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.
　　(Supplementary Note 2)
　　The training apparatus according to Supplementary Note 1,
　　wherein mutual information between the feature channels or a kernel-based distance between the feature channels is computed as the inter-feature redundancy between the feature channels.
　　(Supplementary Note 3)
　　The training apparatus according to Supplementary Note 1 or 2,
　　wherein an entropy of the feature channel or a sparsity of the feature channel is computed as the intra-feature redundancy of the feature channels.
　　(Supplementary Note 4)
　　The training apparatus according to any one of Supplementary Notes 1 to 3,
　　wherein the training dataset further includes a ground truth data that represents an ideal output of the target neural network in a case where the input data included in that training dataset is input into the target neural network,
　　the training of the target neural network includes:
　　　　computing an overall loss that is a sum of a task-specific loss and the computed redundancy, the task-specific loss representing a difference between an actual output of the target neural network in response to the input of the input data and the ground truth data corresponding to that input data; and
　　　　updating trainable parameters of the target neural network based on the overall loss.
　　(Supplementary Note 5)
　　The training apparatus according to Supplementary Note 4,
　　wherein the feature channels in the same feature map are modelled by a probability distribution that is defined by one or more parameters,
　　the training of the target neural network further includes:
　　　　computing a loss for each of the one or more parameters of the probability distribution;
　　　　adding the loss for each of the one or more parameters of the probability distribution into the overall loss; and
　　　　updating the one or more parameters of the probability distribution based on the overall loss.
　　(Supplementary Note 6)
　　A control method performed by a computer, comprising:
　　acquiring a training dataset including an input data;
　　inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　acquiring one or more of the feature maps from the target neural network;
　　computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.
　　(Supplementary Note 7)
　　The control method according to Supplementary Note 6,
　　wherein mutual information between the feature channels or a kernel-based distance between the feature channels is computed as the inter-feature redundancy between the feature channels.
　　(Supplementary Note 8)
　　The control method according to Supplementary Note 6 or 7,
　　wherein an entropy of the feature channel or a sparsity of the feature channel is computed as the intra-feature redundancy of the feature channels.
　　(Supplementary Note 9)
　　The control method according to any one of Supplementary Notes 6 to 8,
　　wherein the training dataset further includes a ground truth data that represents an ideal output of the target neural network in a case where the input data included in that training dataset is input into the target neural network,
　　the training of the target neural network includes:
　　　　computing an overall loss that is a sum of a task-specific loss and the computed redundancy, the task-specific loss representing a difference between an actual output of the target neural network in response to the input of the input data and the ground truth data corresponding to that input data; and
　　　　updating trainable parameters of the target neural network based on the overall loss.
　　(Supplementary Note 10)
　　The control method according to Supplementary Note 9,
　　wherein the feature channels in the same feature map are modelled by a probability distribution that is defined by one or more parameters,
　　the training of the target neural network further includes:
　　　　computing a loss for each of the one or more parameters of the probability distribution;
　　　　adding the loss for each of the one or more parameters of the probability distribution into the overall loss; and
　　　　updating the one or more parameters of the probability distribution based on the overall loss.
　　(Supplementary Note 11)
　　A computer-readable storage medium storing a program that causes a computer to perform:
　　acquiring a training dataset including an input data;
　　inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　acquiring one or more of the feature maps from the target neural network;
　　computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.
　　(Supplementary Note 12)
　　The storage medium according to Supplementary Note 11,
　　wherein mutual information between the feature channels or a kernel-based distance between the feature channels is computed as the inter-feature redundancy between the feature channels.
　　(Supplementary Note 13)
　　The storage medium according to

Supplementary Note

11 or 12,
　　wherein an entropy of the feature channel or a sparsity of the feature channel is computed as the intra-feature redundancy of the feature channels.
　　(Supplementary Note 14)
　　The storage medium according to any one of Supplementary Notes 11 to 13,
　　wherein the training dataset further includes a ground truth data that represents an ideal output of the target neural network in a case where the input data included in that training dataset is input into the target neural network,
　　the training of the target neural network includes:
　　　　computing an overall loss that is a sum of a task-specific loss and the computed redundancy, the task-specific loss representing a difference between an actual output of the target neural network in response to the input of the input data and the ground truth data corresponding to that input data; and
　　　　updating trainable parameters of the target neural network based on the overall loss.
　　(Supplementary Note 15)
　　The storage medium according to Supplementary Note 14,
　　wherein the feature channels in the same feature map are modelled by a probability distribution that is defined by one or more parameters,
　　the training of the target neural network further includes:
　　　　computing a loss for each of the one or more parameters of the probability distribution;
　　　　adding the loss for each of the one or more parameters of the probability distribution into the overall loss; and
　　　　updating the one or more parameters of the probability distribution based on the overall loss.

10 training dataset
12 input data
14 ground truth data
20 target neural network
30 feature map
32 feature channel
1000 computer
1020 bus
1040 processor
1060 memory
1080 storage device
1100 input/output interface
1120 network interface
2000 training apparatus
2020 training dataset acquisition unit
2040 feature map acquisition unit
2060 redundancy computation unit
2080 update unit

Claims

　　A training apparatus comprising:
　　at least one processor; and
　　memory storing instructions,
　　wherein the at least one processor is configured to execute the instructions to:
　　　　acquire a training dataset including an input data;
　　　　input the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　　　acquire one or more of the feature maps from the target neural network;
　　　　compute inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　　　train the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.
　　The training apparatus according to claim 1,
　　wherein mutual information between the feature channels or a kernel-based distance between the feature channels is computed as the inter-feature redundancy between the feature channels.
　　The training apparatus according to claim 1 or 2,
　　wherein an entropy of the feature channel or a sparsity of the feature channel is computed as the intra-feature redundancy of the feature channels.
　　The training apparatus according to any one of claims 1 to 3,
　　wherein the training dataset further includes a ground truth data that represents an ideal output of the target neural network in a case where the input data included in that training dataset is input into the target neural network,
　　the training of the target neural network includes:
　　　　computing an overall loss that is a sum of a task-specific loss and the computed redundancy, the task-specific loss representing a difference between an actual output of the target neural network in response to the input of the input data and the ground truth data corresponding to that input data; and
　　　　updating trainable parameters of the target neural network based on the overall loss.
　　The training apparatus according to claim 4,
　　wherein the feature channels in the same feature map are modelled by a probability distribution that is defined by one or more parameters,
　　the training of the target neural network further includes:
　　　　computing a loss for each of the one or more parameters of the probability distribution;
　　　　adding the loss for each of the one or more parameters of the probability distribution into the overall loss; and
　　　　updating the one or more parameters of the probability distribution based on the overall loss.
　　A control method performed by a computer, comprising:
　　acquiring a training dataset including an input data;
　　inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　acquiring one or more of the feature maps from the target neural network;
　　computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.
　　A computer-readable storage medium storing a program that causes a computer to perform:
　　acquiring a training dataset including an input data;
　　inputting the input data into a target neural network that includes a plurality of feature extraction layers, the feature extraction layer generating a feature map including a plurality of feature channels;
　　acquiring one or more of the feature maps from the target neural network;
　　computing inter-feature redundancy or intra-feature redundancy regarding the acquired feature map, the inter-feature redundancy representing redundancy between the feature channels in a same feature map, the intra-feature redundancy representing redundancy in the feature channel; and
　　training the target neural network using the computed inter-feature redundancy, the computed intra-feature redundancy, or both,
　　wherein the inter-feature redundancy is not computed as a correlation between the feature channels.