CN110462637B

CN110462637B - Neural network data processing device and method

Info

Publication number: CN110462637B
Application number: CN201780088904.6A
Authority: CN
Inventors: 亚采克·科涅奇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2022-07-19
Anticipated expiration: 2037-03-24
Also published as: WO2018171899A1; CN110462637A; EP3590076A1

Abstract

The invention relates to a data processing apparatus (100) comprising a processor (101) for providing a neural network (110), wherein the neural network (110) comprises a neural network layer (120) for generating an array of output data values (121) from a plurality of position dependent kernels (118) and a plurality of arrays of input data values (117), based on the arrays of input data values (117). The invention also relates to a corresponding data processing method.

Description

Neural network data processing device and method

Technical Field

The present invention relates generally to the field of neural network based machine learning or deep learning. More particularly, the present invention relates to a neural network data processing apparatus and method, and in particular to data processing in the fields of audio processing, computer vision, image or video processing, classification, detection and/or identification.

Background

Weighted aggregation is a process of integrating input data to pack information presented in a larger spatial region into a single spatial location, including additional inputs in the form of aggregated weights that control the effect of each input data value on the result, commonly used in many signal processing applications, such as image processing methods for image quality improvement, depth or disparity estimation, and many others [ Kaiming He, Jian Sun, Xiaoou Tang, "guide map filtering," ECCV 2010 ].

In the deep learning field, a common approach recently used in many application fields is to utilize convolutional neural networks. Typically, one particular part of such convolutional neural networks is: at least one convolution layer performs convolution processing on input data values using a kernel K known through learning, which generates one output data value [ j.long, e.shellmer, t.darrell, "full convolution network for semantic segmentation", CVPR 2015 ] for each convolution kernel of each output position. For example, for a two-dimensional case used in the image processing field, the convolution using the kernel K learned through learning can be mathematically expressed as follows:

where out (x, y) represents the array of output data values, in (x-i, y-j) represents the sub-array of the array of input data values, and K (i, j) represents the kernel including the kernel weight or array of kernel values (2r +1) x (2r + 1). B represents an optional bias term learned through learning that may be added to obtain each output data value. The weight of the kernel K is consistent with the entire array in (x, y) of input data values, typically known during the neural network learning phase; if a first order approach is employed, the neural network includes iteratively backpropagating the neural network output gradient to the input layers, and updating the weights of all network layers in accordance with the partial derivatives calculated in this manner.

Disclosure of Invention

The invention aims to provide a more complete data processing device and method based on a neural network.

The above and other objects are achieved by the subject matter of the independent claims. Further implementations are apparent from the dependent claims, the description and the drawings.

In general, embodiments of the invention provide a new method for weighted aggregation of neural network data implemented in a neural network as a new neural network layer. The neural network layer may compute aggregated data using individual aggregated weights known for each individual spatial location. The aggregate weight may be computed as a function of the similarity and weight kernel learned through learning, and then a separate aggregate weight for each output spatial location is learned. In this way, the aggregate weights can be made more adaptive to the input data using various complex location-dependent or location-adaptive kernels learned through the neural network.

More particularly, according to a first aspect, the invention relates to a data processing apparatus comprising one or more processors for providing a neural network. For example, the data to be processed using the data processing device may be two-dimensional image or video data or one-dimensional audio data.

The neural network provided by the one or more processors of the data processing apparatus comprises a neural network layer for processing an array of input data values, e.g. a two-dimensional array in (x, y) of input data values, into an array of output data values, e.g. a two-dimensional array out (x, y) of output data values. The neural network layer may be a first layer of the neural network or an intermediate layer.

The input data value array may be a one-dimensional array (i.e., a vector, such as audio or otherwise, such as a time series), a two-dimensional array (i.e., a matrix, such as an image or other temporal or spatial series), or an N-dimensional array (e.g., any type of N-dimensional feature array provided by conventional preprocessing or feature extraction and/or other layers of a neural network).

The array of input data values may have one or more channels, for example, for an RGB image, one R channel, one G channel, and one B channel; or for black/white images, there is only one grayscale or intensity channel. The term "channel" may refer to any "feature," such as a feature obtained from conventional preprocessing or feature extraction, or from other neural networks or neural network layers of the same neural network. For example, the array of input data values may comprise a two-dimensional RGB or grayscale image or video data representing at least a portion of an image, or a one-dimensional audio signal. If the neural network layer is implemented as an intermediate layer of the neural network, the array of input data values may be any type of array of features generated by the first few layers of the neural network by means of feature extraction on the basis of an initial, e.g. original, array of input data values.

The neural network layer is to: an output data value array is generated from an input data value array based on a plurality of position dependent kernels (i.e., space variant kernels) of the input data value array and a plurality of different sub-arrays located at different positions of the input data value array. Each kernel includes a plurality of kernel values or kernel weights. Employing respective kernels for respective sub-arrays of the array of input data values to generate a single output data value.

As used herein, a "location-dependent kernel" refers to a kernel whose kernel weight depends on the corresponding location of the subarray of input data values, e.g., (x, y) for a two-dimensional array. In other words, for a first kernel, the kernel value employed for a first sub-array of the array of input data values may be different than the kernel value employed for a second sub-array of the array of input data values. In a two-dimensional array, the position may be a spatial position defined, for example, in terms of two spatial coordinates. In a one-dimensional array, the position may be a temporal position defined, for example, according to a time coordinate.

Thus, a more sophisticated data processing device based on neural networks is provided. The data processing apparatus allows the input data to be aggregated in a manner that better reflects the bi-directional data similarity, i.e. input data values that are closer and more similar to the kernel centric input data have a greater impact on the composite output data value. Furthermore, the data processing apparatus allows the kernel weights to be adjusted to accommodate different spatial locations of the array of input data values. This in turn allows, for example, minimizing the impact of certain input data values on the results, such as input data values associated with another portion of the scene (determined from semantic segmentation) or a different object being analyzed.

In another implementation form of the first aspect, the neural network comprises at least one additional network layer configured to: generating a plurality of position dependent kernels based on an original array of original input data values of a neural network, wherein the original array of original input data values of the neural network comprises an array of input values or another array of input values associated with the array of input values. The original array of original input data values may be an array of input data values or other array.

In another implementation form of the first aspect, the neural network is configured to: the plurality of location dependent kernels are generated based on a plurality of location independent kernels learned through learning and a plurality of location dependent weights. Typically, the location-independent kernel may be learned through a learning neural network, and the location-dependent weights or similarity features may be calculated by, for example, another preceding network layer of the neural network. This implementation allows for minimizing the amount of data transferred to the neural network layer to obtain the kernel values. This is because the kernel values are not directly transmitted, but are computed from a plurality of position-dependent weights and/or similarity characteristics that substantially reduce the amount of data per element of the output data value array. This may minimize the amount of data that the neural network stores and transfers between different network layers, which is particularly important in a learning process based on a small batch approach, since the memory of the data processing device (GPU) is currently a major bottleneck.

In another implementation form of the first aspect, the neural network is configured to generate a kernel of the plurality of location-dependent kernels by adding a learned location-independent kernel weighted by an associated location-dependent weight (i.e. similarity feature) that is not learned. This implementation ensures that the plurality of location dependent kernels are implemented in a more efficient manner by using a linear combination of location independent "base kernels".

In another implementation form of the first aspect, the plurality of location-independent kernels are predetermined or learned through learning, wherein the neural network comprises at least one additional network layer or "legacy" preprocessing layer for: a plurality of position-related weights (i.e., similarity features) is generated based on an original array of raw input data values of a neural network, wherein the original array of raw input data values of the neural network includes an array of input values or another array of input values associated with the array of input values. The original array of original input data values may be an array of input data values or other array. In one implementation, at least one additional neural network layer or "legacy" pre-processing layer may generate a plurality of location-dependent weights (i.e., similarity features) using, for example, bilateral filtering, semantic segmentation, instance object detection, and data importance indicators, such as region of interest (ROI).

In another implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, and the convolutional neural network layer is configured to generate a plurality of position-dependent kernels w based on the following equation_L(x，y，i，j)：

Wherein, F_f(x, y) represents said N_fLocation dependent weights (i.e.Similarity feature), K_f(i, j) represents the plurality of location independent kernels.

In another implementation form of the first aspect, the neural network layer is a convolutional network layer or an aggregation network layer.

In another implementation form of the first aspect, the input data value array and the output data value array are two-dimensional arrays, wherein the input data value array in (x, y, c)_i) Having a structure of C_iA different channel, the neural network layer being a convolutional network layer for generating the array of output data values out (x, y, c) based on the following equation_o)：

Wherein r represents the plurality of position-dependent kernels w_L(x，y，c_o，c_iSize of each core of i, j), W_L(x，y，c_o) Representing a normalization factor. In one implementation, the normalization factor W_L(x，y，c_o) May be set to 1.

In another implementation form of the first aspect, the input data value array and the output data value array are two-dimensional arrays, wherein the input data value array in (x, y) has only a single channel, and the neural network layer is an aggregation network layer for generating the output data value array out (x, y) based on the following equation:

Wherein r represents the plurality of position dependent kernels w_LSize of each core of (x, y, i, j), W_L(x, y) represents a normalization factor. In one implementation, the normalization factor W_L(x, y) may be set to 1.

In another implementation manner of the first aspect, the neural network layer is a correlation network layer configured to: generating the output data value array from the input data value array and another input data value array by associating the input data value array with the other input data value array and employing a position dependent kernel of the plurality of position dependent kernels, or by associating the input data value array with the other input data value array and employing a position dependent kernel of the plurality of position dependent kernels associated with the input data value array and another position dependent kernel of the plurality of other position dependent kernels associated with the other input data value array.

In another implementation form of the first aspect, the input data value array in1(x, y), the further input data value array in2(x, y) and the plurality of position dependent kernels w_L1(x, y, i, j) are all two-dimensional arrays, wherein the relevant neural network layer is configured to generate the output data value array out (x, y) based on the following equation:

In another implementation form of the first aspect, the input data value array in1(x, y), theAnother input data value array in2(x, y), the plurality of position dependent kernels w_L1(x, y, i, j) and the plurality of other location-dependent kernels w_L2(x, y, i, j) are all two-dimensional arrays, wherein the relevant neural network layer is configured to generate the output data value array out (x, y) based on the following equation:

wherein r represents the plurality of position dependent kernels w_L1Each kernel of (x, y, i, j) is associated with the plurality of other locality dependent kernels w_L2Size of each kernel of (x, y, i, j), W_L12(x, y) represents a normalization factor. In one implementation, the normalization factor may be set to 1.

In another implementation form of the first aspect, the neural network layer is configured to: generating a respective output data value of the array of output data values by determining a respective input data value of a respective sub-array of input data values of a plurality of sub-arrays of input data values associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.

According to a second aspect, the invention relates to a corresponding data processing method comprising: and generating an output data value array by adopting a neural network layer of the neural network according to the input data value array based on the position correlation kernels and the sub-arrays of the input data value array.

In another implementation form of the second aspect, the method includes: a further step of generating a position dependent kernel of a plurality of position dependent kernels with an additional neural network layer of the neural network based on an original array of original input data values of the neural network, wherein the original array of original input data values of the neural network comprises the array of input data values or another array of input data values associated with the array of input data values.

In another implementation manner of the second aspect, the step of generating a location dependent kernel of the plurality of location dependent kernels comprises: generating a location-dependent kernel of the plurality of location-independent kernels based on a plurality of location-independent kernels and a plurality of location-dependent weights.

In another implementation manner of the second aspect, the step of generating a kernel of the plurality of location dependent kernels comprises: the position independent kernels weighted by the associated position dependent weights are added, i.e. summed.

In another implementation of the second aspect, the plurality of location-independent kernels are subscribed to or learned through learning, and the step of generating the plurality of location-dependent weights comprises: a step of generating the plurality of position-dependent weights with an additional neural network layer or a processing layer of the neural network based on a raw array of raw input data values of the neural network, wherein the raw array of raw input data values of the neural network comprises the array of input data values or another array of input data values associated with the array of input data values.

In another implementation form of the second aspect, the input data value array and the output data value array are two-dimensional arrays, and a plurality of position dependent kernels w are generated_LThe step of the kernel of (x, y, i, j) is based on the following equation:

wherein, F_f(x, y) represents the plurality of N_fPosition dependent weight (i.e. similarity feature), K_f(i, j) represents the plurality of location independent kernels.

In another implementation form of the second aspect, the neural network layer is a convolutional network layer or an aggregation network layer.

In another implementation of the second aspect, the input data value array and the output data value array are two-dimensional arrays, and the neural network layer is a convolutional network layer, wherein the step of generating the output data value array is based on the following equation:

Or the like, or a combination thereof,

wherein the neural network layer is an aggregation network layer, and the step of generating the output data value array is based on the following equation:

in one implementation, the normalization factor W_L(x, y, co) or W_L(x, y) may be set to 1.

In another implementation manner of the second aspect, the neural network layer is a correlation network layer, and the step of generating the output data value array includes: (a) associating the input data value array with other input data value arrays and employing a position dependent kernel of the plurality of position dependent kernels; or (b) generating the array of output data values from the array of input data values and another array of input data values by associating the array of input data values with the other array of input data values and employing a position dependent kernel of the plurality of position dependent kernels associated with the array of input data values and another position dependent kernel of the plurality of other position dependent kernels associated with the other array of input data values.

In another implementation of the second aspect, the input data value array, the further input data value array and the kernel are two-dimensional arrays, and the step of generating the output data value array with the relevant neural network layer is based on the following equation:

Or

In any of the foregoing implementations, the normalization factor W_LOr W_L12May be set to 1.

In another implementation manner of the second aspect, the step of generating the output data values of the array of output data values using the convolutional neural network layer includes: determining input data values of input data value sub-arrays of a plurality of input data value sub-arrays associated with maximum or minimum kernel values of the location dependent kernel, and using the determined input values as output values.

According to a third aspect, the invention relates to a computer program comprising: program code, wherein the method according to the second aspect is performed when the program code is executed on a processor or a computer.

The present invention may be implemented in hardware and/or software or any combination thereof.

Drawings

Embodiments of the invention will be described in conjunction with the following drawings, in which:

FIG. 1 is a schematic diagram of a data processing apparatus based on a neural network according to an embodiment;

FIG. 2 is a diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;

FIG. 3 is a schematic diagram that illustrates a data decrement or summarization concept implemented in a data processing apparatus provided by an embodiment;

FIG. 4 is a diagram illustrating various aspects of a neural network provided by a data processing apparatus according to an embodiment;

FIG. 5 is a diagram illustrating various aspects of a neural network provided by a data processing apparatus according to an embodiment;

FIG. 6 is a diagram illustrating various processing steps of a data processing apparatus according to an embodiment;

FIG. 7 is a diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;

FIG. 8 is a diagram illustrating various aspects of a neural network provided by a data processing apparatus according to an embodiment;

FIG. 9 is a diagram illustrating various processing steps of a data processing apparatus according to an embodiment;

fig. 10 is a flowchart illustrating a neural network data processing method according to an embodiment.

In the figures, identical or at least functionally equivalent features are provided with the same reference signs.

Detailed Description

Reference is now made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific aspects in which the invention may be practiced. It is to be understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

It will be appreciated that the same applies to devices or systems corresponding to the methods for performing the methods, and vice versa, in connection with the methods described. For example, if a specific method step is described, the corresponding apparatus may comprise means for performing the described method step, even if such means are not elaborated or illustrated in the figures. Further, it is to be understood that features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Fig. 1 shows a schematic diagram of a data processing apparatus 100 according to an embodiment, where the data processing apparatus 100 is configured to process data based on a neural network. To this end, the data processing apparatus 100 shown in fig. 1 comprises a processor 101. In an embodiment, the data processing apparatus 100 may be implemented as a distributed data processing apparatus 100, the distributed data processing apparatus 100 comprising a plurality of processors 101 shown in fig. 1.

The processor 101 of the data processing apparatus 100 is configured to provide a neural network 110. The neural network 110 includes neural network layers for: an output data value array is generated from the input data value array based on a plurality of subarrays of the input data value array and a plurality of position dependent kernels comprising a plurality of kernel values or kernel weights, which will be described in more detail below. As shown in fig. 1, the data processing apparatus 100 further comprises a memory 103 for storing and/or retrieving input data values, output data values and/or kernel values.

Each location-dependent kernel includes a plurality of kernel values or kernel weights. For a respective position or element of an array of input data values, a respective kernel is employed for a respective sub-array of the array of input data values to generate a single output data value. As used herein, a "location-dependent kernel" refers to a kernel whose kernel weight depends on the corresponding location of the input data value array subarray for which the kernel is employed. In other words, for a first kernel employed on a first sub-array of the plurality of input data value arrays, the kernel value may be different from a kernel value of a second kernel employed on a second sub-array of the plurality of input data values constituting a different sub-array of the same input value array.

In a two-dimensional array, the position may be a spatial position defined, for example, in terms of two spatial coordinates (x, y). In a one-dimensional array, the position may be a temporal position defined, for example, according to a time coordinate (t).

The input data value array may be a one-dimensional array (i.e., a vector, such as audio or otherwise, such as a time series), a two-dimensional array (i.e., a matrix, such as an image or other temporal or spatial series), or an N-dimensional array (e.g., any type of N-dimensional feature array provided by conventional preprocessing or feature extraction and/or other layers of the neural network 110). The array of input data values may have one or more channels, for example, for an RGB image, one R channel, one G channel, and one B channel; or for black/white images, there is only one grayscale or intensity channel. The term "channel" may refer to any "feature," such as a feature obtained from conventional preprocessing or feature extraction, or from other neural networks or neural network layers of the neural network 110. For example, the array of input data values may comprise a two-dimensional RGB or grayscale image or video data representing at least a portion of an image, or a one-dimensional audio signal. If the neural network layer 120 is implemented as an intermediate layer of the neural network 110, the input data value array may be any type of feature array generated by the first few layers of the neural network by means of feature extraction on the basis of the initial raw array of input data values, as will be described in more detail below.

The neural network layer 120 may be implemented as an aggregation layer 120 for processing each channel of the array of input data values individually, e.g., for a sub-array of an array of R-value input values, a (scalar) R output value is generated. The location dependent kernel may be channel specific or generic to all channels. Furthermore, the neural network layer 120 may be implemented as a convolutional layer for "mixing" all channels of the array of input data values. For example, if the input data value array is an RGB image, i.e. a multi-channel array, only one (scalar) output value is generated for the three channels (R, G and B) of the input data value multi-channel array based on three corresponding sub-arrays of the three input arrays (R, G and B). The location dependent kernel may be channel specific or generic to all channels. In the case of convolutional layer 120, the position-dependent kernel is typically a multi-channel kernel. Furthermore, the neural network layer may be implemented as a correlation layer, providing a combination of aggregation or convolution (input image and weighting kernel) with the additional image (i.e. associating two identical or co-located subarrays of the two images and additionally employing the position correlation kernel according to the correlation result). In this case, the location dependent kernel may also be channel specific or generic to all channels.

Fig. 2 is a schematic diagram illustrating elements of the neural network 110 provided by the data processing apparatus 100 according to an embodiment. In the embodiment illustrated in fig. 2, the neural network layer 120 is implemented as a weighted aggregation layer 120. In another embodiment, the neural network layer 120 may be implemented as a convolutional network layer 120 (also referred to as convolutional network layer 120), or may be implemented as a correlation network layer 120, as will be described in more detail below. As shown in fig. 2, in the present embodiment, the polymerization layer 120 is used to: a two-dimensional array of output data values out (x, y)121 is generated based on a corresponding subarray of the two-dimensional array of input data values in (x, y)117 and a plurality of position-dependent kernels 118 comprising a plurality of kernel values or kernel weights.

In one embodiment, the weighted aggregation layer 120 of the neural network 110 shown in fig. 2 is used to: a plurality of sub-arrays based on the input data value two-dimensional array in (x, y)117 and including a kernel value w_LThe plurality of position correlation kernels 118 of (x, y, i, j) generate the output data value array out (x, y)121 using the following equation:

where r represents the size of each of the plurality of position dependent kernels 118 (in this example, each kernel and each sub-array of the input value array have (2r +1) × (2r +1) kernel values and input values, respectively), and the output data values may be normalized using the following normalization factor:

In other embodiments, the normalization factor may be omitted, i.e., set to 1. For example, if the neural network layer 120 is implemented as the convolutional network layer, the normalization factor may be omitted. For weighted aggregation, the normalization factor allows for maintaining an average or DC component. This is advantageous when the weighted aggregation layer 120 is used to aggregate stereo matching costs for stereo images, since normalization is helpful in ensuring that the output values of different sub-arrays of the array of input data values are comparable. In the case of the convolutional network layer 120, this is generally not necessary.

It will be appreciated that the above equations for a two-dimensional input array and a quadratic function shape kernel are readily applicable to the case of input value arrays 117 having one dimension or more than two dimensions and/or rectangular (e.g., non-square rectangles with different horizontal and vertical dimensions) kernels.

In an embodiment, wherein the neural network layer 120 is implemented as a convolutional network layer, the array of input data values in (x, y, c)_i)117 is formed with a plurality of channels c_iFor example, in the case of RGB image data, the neural network layer 120 may be used to: based on the plurality of two-dimensional arrays in (x, y, c) of input data values in different channels _i)117 and includes a kernel value w_L(x，y，c_o，c_iI, j) generates an array out (x, y, c) of output data values having one or more channels using the following equation_o)121：

Wherein, c_iRepresenting the input data value array in (x, y, c)_i)117, the output data value may be normalized byAnd (3) performing normalization treatment:

in one embodiment, the neural network layer 120 is configured to generate an array of output data values 121 that is smaller than the array of input data values 117. In other words, in an embodiment, neural network 110 is configured to perform a decrement operation based on the plurality of position-dependent cores 118. Fig. 3 illustrates a decrement operation provided by the neural network 110 of the data processing apparatus 100 according to an embodiment. Using a decrementing operation allows for an increase in the receptive field, enabling data processing through a cascade of smaller filters than a single layer containing kernels covering the same receptive field, while enabling neural network 110, to better analyze the data by discovering more complex relationships between the data, and by adding more nonlinearity to the processing chain by separating each convolutional layer and nonlinear elements, such as sigmoid functions or Linear rectification functions (relus).

In the decrementing operation illustrated in fig. 3, the neural network layer 120 may combine the input data values to generate a reduced resolution array of output data values. This may be achieved by convolving the array 117 of input data values with a position dependent kernel 118 with a step size S greater than 1. The step size S specifies the spacing between adjacent input spatial locations for which the convolution is calculated. If the step S is equal to 1, a convolution calculation is performed for each spatial position. If the step size S is greater than 1, the neural network layer 120 is configured to perform convolution calculations for every S-th spatial position of the array of input data values 117, thus reducing the output resolution (i.e., reducing the output data value array 121 by a factor S for each spatial dimension). The horizontal step size and the vertical step size may be the same or different.

In the exemplary embodiment illustrated in fig. 3, neural network layer 120 combines input data value array 117 according to a spatial region of size (2r +1) x (2r +1) to generate a corresponding output data value of output data value array 121. In this way, input data values 117 may be aggregated to package information presented in a larger spatial region into a single spatial location.

In the exemplary embodiment shown in fig. 2, the neural network 110 includes one or more leading layers 115 before the neural network layer 120 and one or more subsequent layers 125 after the neural network layer 120. In an embodiment, the neural network layer 120 may be the first and/or last data processing layer of the neural network 110, i.e., in an embodiment, there may be no preceding layer 115 and/or subsequent layer 125.

In an embodiment, the one or more pre-conductive layers 115 may be other neural network layers and/or "traditional" pre-processing layers, such as feature extraction layers. Similarly, in an embodiment, the one or more pre-conductive layers 125 may be other neural network layers, such as deconvolution layers, and/or "conventional" post-processing layers.

As shown in the embodiment illustrated in fig. 2, one or more of the preceding layers 115 may be used to provide, i.e., generate, a plurality of position dependent cores 118 (see the underlying signal path from derived data 113 to the preceding layer 115 of position dependent cores 118 shown in fig. 2). In one embodiment, one or more of the leading layers 115 may generate a plurality of position-dependent kernels 118 based on an original array 111 of original input data values, such as an original image (2D for example). As shown in FIG. 2, in one embodiment, the raw array of raw input data values 111 may be the input data array 111 that is the raw input to the neural network 110. In another embodiment, the one or more leading layers 115 may be used to generate a plurality of position-dependent kernels 118 based on raw input data 111 of the neural network 110, and to provide the raw input data 111 of the neural network 110 to the neural network layer 120 as an array of input data values 117 (according to an embodiment of the neural network layer 120, no leading layer is present in the top-level signal path from the raw input array 111 to the leading layer 115 of the input array 117, see fig. 2). In other words, according to one embodiment, the original array 111 may form the input array 117.

As shown in FIG. 2, in another embodiment, one or more of the leading layers 115 of the neural network 110 are used to generate a plurality of steering data arrays 113A location correlation kernel 118. A more detailed view of the processing steps of the neural network 110 of the data processing apparatus 100 provided by such an embodiment is shown in fig. 4 for an exemplary case of a two-dimensional input and output array. One or more of the leading layers 115 of the neural network 110 use the pilot data array 113 in the pilot data array w_LBased on (x, y)113, a plurality of position correlation kernels g (x, y)118 are generated. As described in the context of fig. 2, the neural network layer 120 is to: generating a two-dimensional array w of output data values based on a two-dimensional array out (x, y)117 of input data values and a plurality of position dependent kernels in (x, y)118_L(x, y)121, which in turn is based on the steering data array g (x, y) 113.

In one embodiment, one or more of the leading layers 115 of the neural network 110 is a neural network layer configured to: based on the guide data array w_L(x, y)113, learning a plurality of position correlation kernels g (x, y) 118. In another embodiment, one or more of the pre-conductive layers 115 of the neural network 110 is a pre-processing layer for: based on the pilot data array 113, a plurality of position-dependent kernels w are generated using one or more pre-processing schemes, such as feature extraction _L(x，y)118。

In one embodiment, one or more of the leading layers 115 of the neural network 110 are used to: based on the guide data array g (x, y)113, a method similar to bilateral filtering is adopted to generate a plurality of position correlation kernels w_L(x, y)118 as shown in FIG. 5. Bilateral filtering is known in the field of image processing for performing weighted aggregation of data while reducing the impact of certain input values on the aggregated result and amplifying the impact of other input values on the aggregated result [ m.elad, "source of bilateral filter and method for improving it", "IEEE image processing journal, 11, 10, 2002, vol 10, page 10, 1141, 1151 ]. As shown in fig. 5, the weights 518 for the aggregated input data value array 517 are applied to the input data using the steering image data g 513, the steering image data g 513 providing additional information to control the aggregation process. In one embodiment, the pilot image data array 513 may be equal to the input data value array, for example, based on the weights 518,the output data value array 521 is generated using the layer 520. Bilateral filter weights 518 take into account the distance of the kernel-in value from the kernel center, and in addition, the similarity of the data value to the kernel-in-center data, mathematically described using the following equation:

Wherein the normalization factor is based on the following equation:

in one embodiment, the bilateral filter weights 518 are defined using the following equation:

where d (,) represents a distance function.

Fig. 6 shows a schematic diagram highlighting the main processing stages 601 of a data processing apparatus 100 provided by an embodiment, for example the data processing apparatus 100 providing the neural network 110 shown in fig. 2. As previously described, in a first processing step 603, the neural network 110 may generate a plurality of position-dependent kernels w based on the steering data array g (x, y)113_L(x, y) 118. In a second processing step 605, the neural network 110 may be based on the array of input data values in (x, y)117 and the plurality of position-dependent kernels w_L(x, y, i, j)118 generates output data value array out (x, y) 121.

Fig. 7 shows a schematic diagram of the neural network 110 provided by the data processing apparatus 100 according to another embodiment. The main difference with the embodiment shown in fig. 2 is that in the embodiment shown in fig. 7, the neural network 110 is used to: based on a plurality of location-independent kernels 119b (shown in FIG. 8) and a plurality of location-dependent weights F_f(x, y)119a (also referred to as similarity feature 119), which generates the plurality of location-dependent kernels, as will be described in more detail below Described in detail. In an embodiment, the similarity features 119a obtained based on the guidance data 113 may indicate deep knowledge about the input data 111, including, for example, semantic segmentation, example object detection, data importance indicators, such as regions of interest (ROI), and many other aspects — all known through the learning neural network 110 or as additional input to the neural network 110. In one embodiment, the neural network 110 of FIG. 7 is used to: by adding position-dependent weights F associated therewith_f(x, y)119a weighted location-independent kernel 119b to generate a plurality of location-dependent kernels 118.

In one embodiment, the location independent kernel 119b may be predetermined or known through the learning neural network 110. As shown in FIG. 7, also in this embodiment, the neural network 110 may include one or more leading layers 115, which leading layers 115 precede the neural network layer 120, and may be implemented as additional neural network layers or preprocessing layers. In one embodiment, one or more layers of the leading layer 115 are used to: generating a plurality of position dependent weights F based on an original array of original input data values or steering data 113_f(x, y)119 a. The raw array of raw input data values of the neural network 110 may include an array of input data values 117 to be processed using the neural network layer 120, or other arrays of input data values 111 associated with the array of input data values 117, such as the initial or raw array of input data 111.

In the exemplary embodiment shown in fig. 7, the input data value in (x, y)117 and the output data value array out (x, y)121 are both two-dimensional arrays, and the neural network layer 120 is configured to: generating a plurality of position dependent kernels w based on the following equation_LRespective cores of (x, y, i, j) 118:

wherein, F_f(x, y) represents said N_fSet of individual position-related weights (or similarity features) 119a, K_f(i, j) represents a plurality of location independent kernels 119b, also as shown in FIG. 8Shown in the figure.

Fig. 9 shows a schematic diagram highlighting the main processing stages 901 of a data processing apparatus 100 provided by an embodiment, for example the data processing apparatus 100 providing the neural network 110 shown in fig. 7 and 8. As previously described, in a first processing step 903, the neural network 110 may generate a plurality of position-related weights or similarity features F based on the steering data array g (x, y)113_f(x, y)119 a. In a second processing step 905, the neural network 110 may be based on a plurality of location-related weights or similarity features F_f(x, y)119a and a plurality of location independent kernels K_f(i, j)119b generates a plurality of position-dependent kernels w_L(x, y, i, j) 118. In another step (not shown in fig. 9, but similar to the processing step 605 shown in fig. 6), the neural network layer 120 may be based on the array of input data values in (x, y)117 and the plurality of position-dependent kernels w _L(x, y, i, j)118 generates output data value array out (x, y) 121.

As indicated previously, in one embodiment, the neural network layer 120 of the neural network 110 may be implemented as a correlation network layer 120, the correlation network layer 120 being configured to: an output data value array 121 is generated from an input data value array 117 and another input data value array by associating the input data value array 117 with the other input data value array and employing respective position dependent kernels of a plurality of position dependent kernels 118 for respective sub-arrays of the input data value array 117 and respective sub-arrays of the other input data value array. In the case of input data value data 117, the further array of input data values and the plurality of position-dependent kernels 118 being respective two-dimensional arrays (as described in the embodiments illustrated in fig. 2 and 7), the relevant neural network layer 120 may be operable to generate an array of output data values out (x, y)121 based on the following equation:

where in1(x-i, y-j) represents the array of input data values 117, in2(x-i, y-j) represents the other array of input data values, w_L1(x, y, i, j) represents the plurality of position-dependent cores 118, and r represents the size of each of the plurality of position-dependent cores 118 (in this example, the core value of each core is (2r +1) × (2r + 1)). The output data value 121 may be normalized using the following normalization factors:

In other embodiments, the normalization factor may be omitted, i.e., set to 1. It will be appreciated that the above equations for a two-dimensional input array and a quadratic function shape kernel are readily applicable to the case of input value arrays 117 having one dimension or more than two dimensions and/or non-square rectangular kernels (i.e., different horizontal and vertical dimensions).

In another embodiment, the correlation network layer 120 is configured to: the output data value array 121 is generated from the input data value array 117 and the further input data value array by associating the input data value array 117 with the further input data value array and employing, for respective sub-arrays of the input data value array 117 and respective sub-data of the further input data value array, respective ones of a plurality of position dependent cores 118 associated with the input data value array 117 and a further one of a plurality of further position dependent cores associated with the further input data value array. In the case where the input data value data 117 and the further input data value array are respective two-dimensional arrays (as described in the embodiments illustrated in fig. 2 and 7), the relevant neural network layer 120 may be operable to generate an output data value array out (x, y)121 based on the following equation:

Where in1(x-i, y-j) represents the array of input data values 117, in2(x-i, y-j) represents the other array of input data values, w_L1(x, y, i, j) represents the plurality of position-dependent kernels 118, w_L2(x, y, i, j) represents the plurality of other position-dependent kernels associated with the other array of input data values. The output data values 121 may be normalized using the following normalization factors:

in another embodiment, the neural network layer 120 is configured to: the input data value array 117 is processed using a maximum or minimum pooling scheme based on a plurality of location dependent kernels 118. More specifically, in such embodiments, the neural network layer 120 is to: a corresponding output data value of the array of output data values 121 is generated by determining a corresponding input data value of a corresponding sub-array of the plurality of sub-arrays of the array of input data values 117 associated with a maximum or minimum core value of a corresponding position dependent core of the plurality of position dependent cores 118 and using the corresponding determined input data value as the corresponding output data value.

In another embodiment, the neural network 110 performs a weighted aggregation of stereo matching costs using the neural network layer 120 provided according to one of the foregoing embodiments in order to obtain a depth map from a stereo image. Cost aggregation is a common method for minimizing noise and improving depth estimation results. Without additional weighting, object boundaries at depth discontinuities are typically subject to excessive smoothing. It is therefore desirable that the features maintain these boundaries by taking into account some additional knowledge about the boundaries of objects in the scene. Thus, it is advantageous that the neural network layer 120 may use object features, e.g. obtained by semantic segmentation, as the guiding data 113 to determine object boundaries in the scene and to guide the aggregation process of the input stereo matching costs, generating the aggregated stereo matching codes as output.

Fig. 10 shows a flowchart of a data processing method 1000 based on the neural network 110 according to an embodiment. The data processing method 1000 may be performed using the data processing apparatus 100 shown in fig. 1 and its various embodiments. The data processing method 1000 includes: step 1001 of generating an array of output data values 121 from the array of input data values 117 using the neural network layer 120 of the neural network 110 based on the plurality of position correlation kernels 118 and the plurality of subarrays of the array of input data values 117. Embodiments of the data processing method may be implemented and/or performed using one or more processors as described above.

Returning to the various embodiments described above, a first kernel is considered to be different from a second kernel if the kernel values of the kernel value array of the first kernel located in at least one location (or at least one element) of the first kernel are different from the kernel values of the kernel value array of the second kernel located in the same location (or same element) of the kernel. Typically, a kernel is employed for a sub-array of an input value array, the kernel having the same size (number of elements, number of positions, or number of values per dimension) and dimension (the number of dimensions N, N > ═ 1) as the sub-array of the input value array. Typically, the size and dimensions of different sub-arrays of the input value array are the same. Accordingly, the size and dimensions of the different kernels are typically the same.

If a first sub-array of an input value array includes at least one element of the input value array (a second sub-array of the input value array does not include the at least one element), then the first sub-array of the input value array is treated as being different from the second sub-array of the input value array. Typically, different sub-arrays of the array of input values differ by at least one column or row of elements of the array of input values. The different subarrays may or may not overlap, as shown in FIG. 3.

Some additional details regarding the various aspects and embodiments (aggregation network layer, convolutional network layer, correlation network layer, and normalization) are provided below.

Polymerisation

The proposed embodiments of guided aggregation can be applied to guided feature map reduction. Further aggregation may be performed in a controlled manner by grouping input values that are features of the feature map to form input data sub-arrays of input data arrays using input location dependent kernels as steering data, generating representative output feature values for the entire sub-array. In this way, by altering the resolution of the input feature map, object boundaries and other detailed information that is typically lost when performing the zoom-out may be better preserved. In such cases, the guidance data represents information about the boundary of the object or region, which is obtained by, for example, color-based segmentation, semantic segmentation using a neural network precursor layer, or an edge map of the texture image corresponding to the processed feature map.

Convolution with a bit line

The proposed embodiments of guided convolution can be applied to switchable feature extraction. Convolving the input values as features of the feature map using an adaptive feature extraction filter formed from input guide data in the manner of a position dependent kernel. Thus, each selected region of the input feature map may be processed using a feature extraction filter to generate only features applicable to those regions. Here, the guidance data in the form of similarity features represents information about an object or region, which is obtained by, for example, color-based segmentation, semantic segmentation using a neural network precursor layer, edge map of texture image corresponding to the processed feature map, or region of interest (ROI) binary map.

Correlation

The proposed embodiments of the guiding correlation can be applied to the guiding correlation of input feature maps. By using the input position correlation kernel as navigation data, input values that are features of two or more feature maps are correlated in a controlled manner, ensuring that certain features within the region of interest are amplified or attenuated. In this way, features corresponding to certain other objects/regions in the feature map may be excluded or employed to less affect the computed results. At the same time, certain features specific to the selected area may be magnified. Here, the guide data represents information on an object or region, which is obtained by, for example, color-based segmentation, semantic segmentation using a neural network leading layer, edge map of texture image corresponding to the processed feature map, or region of interest (ROI) binary map.

Normalization

In general, normalization is advantageous if the output values obtained for different spatial positions are to be compared with each other and without any intermediate steps. Therefore, it is understandable to maintain the mean (DC) classification. If no such comparison is performed, no normalization is required, but normalization only increases complexity. Furthermore, normalization may be omitted to simplify the calculation and only calculate an approximate result.

While a particular feature or aspect of the invention may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "has," "having," or any other variation thereof, are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted. Also, the terms "exemplary," "e.g.," are merely meant as examples, and not the best or optimal. The terms "coupled" and "connected," along with their derivatives, may be used. It will be understood that these terms may be used to indicate that two elements co-operate or interact with each other, whether or not they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the above claims below are recited in a particular sequence with corresponding labeling, unless the recitation of the claims otherwise implies a particular sequence for implementing some or all of the elements, the elements are not necessarily limited to being implemented in the particular sequence described.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the foregoing teachings. Of course, those skilled in the art will readily recognize that there are numerous other applications of the present invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art will recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1. A data processing apparatus (100) for processing audio or image data, comprising:

a processor (101) for operating a neural network (110), wherein the neural network (110) comprises a neural network layer (120), the neural network layer (120) being configured to generate an array of output data values (121) from an array of input data values (117) in dependence on a plurality of position dependent kernels (118; 119) and a plurality of sub-arrays (117) of the array of input data values (117); the position-dependent kernel is a kernel of which the kernel weight depends on the corresponding position of the input data value subarray; the input data value array comprises a two-dimensional array and a one-dimensional array, the two-dimensional array is image data, and the one-dimensional array is audio data; in a two-dimensional array, the position is a spatial position defined in terms of two spatial coordinates; in the one-dimensional array, the position is a time position defined according to a time coordinate;

the plurality of location dependent kernels (118) are determined by presetting or learning, and the neural network (110) comprises an additional neural network layer (115), the additional neural network layer (115) being configured to: generating a plurality of position dependent weights (119a) based on an original array (111, 117) of raw input data values of the neural network (110), wherein the original array (111, 117) of raw input data values of the neural network (110) comprises the array (117) of input data values or another array (111) of input data values associated with the array (117) of input data values;

The additional neural network layer (115) is to: generating the plurality of location dependent kernels (118) based on a plurality of location independent kernels (119b) and a plurality of location dependent weights (119 a).

2. The data processing apparatus (100) of claim 1, wherein the neural network (110) comprises an additional neural network layer (115), the additional neural network layer (115) being configured to: generating a plurality of position dependent kernels (118) based on an original array (111, 117) of raw input data values of the neural network (110), wherein the original array (111, 117) of raw input data values of the neural network (110) comprises the array (117) of input data values or another array (117) of input data values associated with the array (111) of input data values.

3. The data processing apparatus (100) of claim 2, wherein the additional neural network layer (115) is configured to: generating the plurality of location-independent kernels (118) by adding the location-independent kernels (119b) weighted with the associated location-dependent weights (119 a).

4. The data processing apparatus (100) of any of claims 1 to 3, wherein the input data value array (117) and the output data value array (121) are both two-dimensional arrays, the neural network layer (120) being configured to: generating the plurality of position dependent kernels w on the basis of the following equation _L(x,y,i,j)(118)：

Wherein, F_f(x, y) represents the plurality of N_fPosition dependent weights (119a), K_f(i, j) represents the plurality of location independent kernels (119 b).

5. The data processing device (100) of claim 4, wherein the neural network layer (120) is a convolutional network layer or an aggregation network layer.

6. The data processing apparatus (100) of claim 5, wherein the array of input data values (117) and the array of output data values (121) are both two-dimensional arrays, wherein the neural network layer (120) is a convolutional network layer for generating the array of output data values (121) based on the following equation:

and

or

W_L(x,y,c_o)＝1，

Out (x, y, c)_o) Represents the output data value array (121), in (x, y, c)_i) Representing the input data value array (117), r representing the plurality of position dependent kernels w_L(x,y,c_o,c_iSize of each core of i, j), W_L(x,y,c_o) Represents a normalization factor; or the like, or, alternatively,

wherein the neural network layer (120) is an aggregation network layer for generating the array of output data values (121) based on the following equation:

and:

or

W_L(x,y)＝1，

Wherein out (x, y) represents the outputAn array of data values (121), in (x, y) representing the array of input data values (117), r representing the plurality of position dependent kernels w _LSize of each core of (x, y, i, j), W_L(x, y) represents a normalization factor.

7. The data processing device (100) of claim 6, wherein the neural network layer (120) is a convolutional network layer or an aggregation network layer configured to: generating the output data value array (121) from the input data value array (117) and a further input data value array by:

associating the input data value array (117) with the further input data value array and employing a position dependent kernel of the plurality of position dependent kernels (118); or

Associating the input data value array (117) with the further input data value array and employing a position dependent kernel of the plurality of position dependent kernels (118) associated with the input data value array (117) and a further position dependent kernel of a plurality of further position dependent kernels associated with the further input data value array.

8. The data processing apparatus (100) of claim 7, wherein the array of input data values (117), the further array of input data values, and the plurality of position-dependent kernels (118) are respective two-dimensional arrays, wherein the neural network layer (120) is configured to generate the array of output data values (121) based on the following equation:

And:

or

W_L(x,y)＝1，

Wherein out (x, y) represents the number of outputsA data value array (121), in1(x, y) representing the input data value array (117), in2(x, y) representing the further input data value array, representing the plurality of position-dependent kernels w_L1Size of each core of (x, y, i, j), W_L(x, y) represents a normalization factor;

or

And:

or

W_L12(x,y)＝1，

Wherein out (x, y) represents the output data value array (121), in1(x, y) represents the input data value array (117), in2(x, y) represents the further input data value array, representing the plurality of position-dependent kernels w_L1(x, y, i, j) and said plurality of further position-dependent kernels w_L2Size of each core of (x, y, i, j), W_L(x, y) represents a normalization factor.

9. The data processing apparatus (100) of claim 8, wherein the neural network layer (120) is configured to: generating respective output data values of the array of output data values (121) by determining respective input data values of respective ones of a plurality of sub-arrays of input data values associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data values as respective output data values.

10. A data processing method (1000) for processing audio or image data, comprising:

Generating (1001) an array of output data values (121) with a neural network layer (120) of a neural network (110) from an array of input data values (117) based on a plurality of position-dependent kernels (118) and a plurality of sub-arrays of the array of input data values (117);

the neural network (110) is configured to: generating the plurality of location dependent kernels (118) based on a plurality of location independent kernels (119b) and a plurality of location dependent weights (119 a); the position-dependent kernel is a kernel of which the kernel weight depends on the corresponding position of the input data value subarray; the input data value array comprises a two-dimensional array and a one-dimensional array, the two-dimensional array is image data, and the one-dimensional array is audio data; in a two-dimensional array, the position is a spatial position defined in terms of two spatial coordinates; in the one-dimensional array, the position is a time position defined according to a time coordinate;

the plurality of location dependent kernels (118) are determined by pre-set or learning, and the neural network (110) comprises an additional neural network layer (115), the additional neural network layer (115) being configured to: generating the plurality of position-dependent weights (119a) based on a raw array (111, 117) of raw input data values of the neural network (110), wherein the raw array (111, 117) of raw input data values of the neural network (110) comprises the input data value array (117) or another input data value array (111) associated with the input data value array (117).

11. A computer-readable storage medium, having program code stored thereon, wherein the program code, when executed by a processor, performs the method (1000) of claim 10.