WO2022196060A1 - 情報処理装置、情報処理方法及び非一時的なコンピュータ可読媒体 - Google Patents
情報処理装置、情報処理方法及び非一時的なコンピュータ可読媒体 Download PDFInfo
- Publication number
- WO2022196060A1 WO2022196060A1 PCT/JP2022/000995 JP2022000995W WO2022196060A1 WO 2022196060 A1 WO2022196060 A1 WO 2022196060A1 JP 2022000995 W JP2022000995 W JP 2022000995W WO 2022196060 A1 WO2022196060 A1 WO 2022196060A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature map
- feature
- information processing
- components
- unit
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 119
- 238000003672 processing method Methods 0.000 title claims description 6
- 238000000605 extraction Methods 0.000 claims abstract description 51
- 239000000470 constituent Substances 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims description 86
- 239000000284 extract Substances 0.000 abstract description 20
- 238000004364 calculation method Methods 0.000 description 60
- 238000000034 method Methods 0.000 description 40
- 238000010586 diagram Methods 0.000 description 21
- 238000004590 computer program Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 11
- 230000002776 aggregation Effects 0.000 description 9
- 238000004220 aggregation Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 7
- 230000004044 response Effects 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 229940050561 matrix product Drugs 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
- G06V10/7515—Shifting the patterns to accommodate for positional errors
Definitions
- the present invention relates to an information processing device, an information processing method, and a non-transitory computer-readable medium.
- Patent Document 1 a neural network that learns the relationship between classification information and features extracted from a sound source, language, or image is used to provide a partial highlight section that is not the entire sound source section. is described.
- the purpose of this disclosure is to improve the technology disclosed in prior art documents.
- An information processing apparatus performs a first feature map according to a first feature configured with a plurality of first components, a second feature map configured with a plurality of second components, and a second and a grid pattern showing a plurality of said second components corresponding to said one said first component, and extracting means for extracting a second feature map relating to said feature and a third feature map relating to said third feature on the second feature map based on the position of each first component to determine, for each said first component, a correspondence indicating a corresponding plurality of said second components means, and reflecting means for reflecting the correlation between the first feature and the second feature calculated from the correspondence relationship in the third feature map.
- An information processing method of one aspect according to the present embodiment includes, from a feature map, a first feature map according to a first feature configured with a plurality of first components, a second feature map configured with a plurality of second components, A second feature map related to the feature of and a third feature map related to the third feature are extracted, and a grid pattern showing a plurality of the second components corresponding to one of the first components, each determining, for each said first component, a correspondence indicating a corresponding plurality of said second components by shifting on said second feature map based on the position of said first component;
- the information processing apparatus reflects the correlation between the first feature and the second feature calculated from the above in the third feature map.
- a non-transitory computer-readable medium comprises a first feature map according to a first feature composed of a plurality of first components from a feature map, and a plurality of second components.
- a grid showing a plurality of second components corresponding to one first component by extracting a second feature map related to the second feature and a third feature map related to the third feature determining, for each said first component, a correspondence indicating a corresponding plurality of said second components by shifting a pattern on said second feature map based on the position of each said first component; , a program for causing an information processing apparatus to reflect the correlation between the first feature and the second feature calculated from the correspondence relationship in the third feature map.
- FIG. 1 is a schematic diagram showing a first related technique; FIG. It is a schematic diagram which shows a 2nd related technique.
- 1 is a schematic diagram illustrating an embodiment of this disclosure
- FIG. 2 is a block diagram showing the hardware configuration of an information processing apparatus according to each embodiment
- FIG. 2 is a block diagram showing the functional configuration of the information processing device according to the first exemplary embodiment
- FIG. 4 is a flow chart showing the flow of operations of the information processing apparatus according to the first exemplary embodiment
- 2 is a block diagram showing a functional configuration of an information processing device according to a second embodiment
- FIG. 9 is a flow chart showing the flow of operations of the information processing apparatus according to the second embodiment
- FIG. 9 is a schematic diagram showing in more detail the processing of the information processing apparatus according to the second embodiment
- FIG. 10 is a drawing showing a feature map of queries and keys according to the second embodiment
- FIG. FIG. 10 is a drawing showing a feature map of queries and keys according to the second embodiment
- FIG. FIG. 10 is a drawing showing a feature map of queries and keys according to the second embodiment
- FIG. 10 is a drawing showing a feature map of queries and keys according to the second embodiment
- FIG. 9 is a flow chart showing the flow of detailed operations of a computing unit according to the second embodiment
- FIG. 11 is a block diagram showing a functional configuration of an information processing apparatus according to a third embodiment
- FIG. 10 is a flow chart showing the flow of operations of the information processing apparatus according to the third embodiment
- FIG. 13 is a block diagram showing a functional configuration of an information processing apparatus according to a fourth embodiment
- FIG. 14 is a flow chart showing the flow of operations of the information processing apparatus according to the fourth embodiment
- FIG. 12 is a block diagram showing a functional configuration of an information processing apparatus according to a fifth embodiment
- FIG. 11 is a schematic diagram showing processing of an information processing apparatus according to a sixth embodiment
- the first related technology is X. Wang, R. Girshick, A. Gupta, K. He, “Non-Local Neural Networks”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794, which is a non-patent document. -7803, 2018. discloses a technique that uses feature maps obtained from convolutional layers of a convolutional neural network and weights the feature maps by an attention mechanism to improve feature extraction.
- FIG. 1A is a schematic diagram showing a first related technique.
- FIG. 1A shows that for one component (eg, pixel) i of a query, features are extracted by referencing the entire space of key feature maps.
- the entire space of the key feature map is taken into account, so it is possible to extract features over a wide area.
- the calculation cost increases because calculation is required for the entire feature map of the key.
- FIG. 1B is a schematic diagram showing a second related technique.
- FIG. 1B shows that a feature is extracted for one component i of a query by referring to a partial area AR in a key feature map.
- the partial area AR is the key component i and its surrounding neighborhood area corresponding to the query component i.
- the second related technique can reduce the computational cost compared to the first related technique because the calculation of the correlation between the query and the key, which are the two embedded features, requires a smaller area to be calculated.
- the partial area AR is a local area of the feature map of the key, another problem arises in that the advantage of global feature extraction, which is the original purpose of the attention mechanism, may be degraded.
- this technique can provide an information processing apparatus or the like that is capable of extracting features in consideration of the entire space of the input feature map and that can perform calculations at a low calculation cost.
- FIG. 1C is a schematic diagram showing one embodiment of this disclosure.
- FIG. 1C shows that for one component i of the query, features are extracted by referring to regions in a grid pattern (checkerboard pattern) distributed throughout the space of the key feature map.
- a grid pattern is a pattern consisting of a plurality of component reference areas where the spacing between the reference areas of the nearest components in a given direction is the same on a map of any dimension.
- the grid pattern is a grating pattern in which each side of a unit rectangle (for example, a square) has an arbitrary length, and the reference area is the grid points in the grid pattern. It can be said that it is a pattern that shows It should be noted that one unit of the reference area in the grid pattern may be composed of one component of the key, or may be composed of a plurality of components of the key.
- the entire space of the key feature map is considered, so it is possible to extract features over a wide area. Furthermore, since the area to be calculated is a part of the feature map of the key rather than the entire key feature map, the necessary calculation cost can be reduced. For example, if the area area of the grid pattern in FIG. 1C is made the same as the area of the partial area AR in FIG. 1B, the calculation cost can be made the same as the calculation cost according to the second related technique.
- the technology described in this disclosure is not limited to this example. In addition, this method can be applied to various uses as described later.
- the information processing apparatus 10 includes a processor 101, a RAM (Random Access Memory) 102, a ROM (Read Only Memory) 103, and a storage device 104.
- the information processing device 10 may further include an input device 105 and an output device 106 .
- Processor 101 , RAM 102 , ROM 103 , storage device 104 , input device 105 and output device 106 are connected via data bus 107 .
- This data bus 107 is used for transmitting and receiving data between connected components.
- the processor 101 reads a computer program.
- processor 101 is configured to read a computer program stored in at least one of RAM 102 , ROM 103 and storage device 104 .
- the processor 101 may read a computer program stored in a computer-readable recording medium using a recording medium reader (not shown).
- the processor 101 may acquire a computer program (that is, may read a computer program) from a device (not shown) arranged outside the information processing device 10 via a network interface.
- the processor 101 controls the RAM 102, the storage device 104, the input device 105 and the output device 106 by executing the read computer program.
- the processor 101 may implement functional blocks for executing various types of processing relating to feature amounts. This functional block will be described in detail in each embodiment.
- processor 101 CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), FPGA (Field-programmable Gate Array), DSP (Demand-Side Platform), ASIC (Application Specific Integrated Circuit).
- CPU Central Processing Unit
- MPU Micro Processing Unit
- GPU Graphics Processing Unit
- FPGA Field-programmable Gate Array
- DSP Demand-Side Platform
- ASIC Application Specific Integrated Circuit
- the RAM 102 is a memory that temporarily stores computer programs executed by the processor 101 .
- the RAM 102 may also temporarily store data temporarily used by the processor 101 while the processor 101 is executing the computer program.
- the RAM 102 may be, for example, a RAM such as a DRAM (Dynamic Random Access Memory) or an SRAM (Static Random Access Memory). Also, other types of volatile memory may be used instead of RAM.
- the ROM 103 is a memory that stores computer programs executed by the processor 101 .
- the ROM 103 may also store other fixed data.
- the ROM 103 may be a ROM such as PROM (Programmable ROM), EPROM (Erasable Programmable Read Only Memory), for example. Also, other types of non-volatile memory may be used instead of the ROM.
- the storage device 104 stores data that the information processing device 10 saves for a long time.
- Storage device 104 may act as a temporary storage device for processor 101 .
- the storage device 104 may include, for example, at least one of a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), and a disk array device.
- the input device 105 is a device that receives input instructions from the user of the information processing device 10 .
- Input device 105 may include, for example, at least one of a keyboard, mouse, and touch panel.
- the input device 105 may be a dedicated controller (operation terminal).
- the input device 105 may include a terminal owned by the user (for example, a smart phone, a tablet terminal, or the like).
- the input device 105 may be a device capable of voice input including, for example, a microphone.
- the output device 106 is a device that outputs information about the information processing device 10 to the outside.
- the output device 106 may be a display device (eg, display) capable of displaying information about the information processing device 10 .
- the display device here may be a television monitor, a personal computer monitor, a smart phone monitor, a tablet terminal monitor, or a monitor of other portable terminals.
- the display device may be a large monitor, digital signage, or the like installed in various facilities such as stores.
- the output device 106 may be a device that outputs information in a format other than an image.
- the output device 106 may be a speaker that outputs information about the information processing device 10 by voice.
- FIG. 3 is a block diagram showing the functional configuration of the information processing apparatus according to the first embodiment;
- the information processing apparatus 11 according to the first embodiment includes a caution mechanism unit 110 as a processing block for realizing its functions.
- the attention mechanism unit 110 comprises an extractor 111 , a determiner 112 and a reflector 113 .
- each of the extracting unit 111, the determining unit 112, and the reflecting unit 113 may be realized by the above-described processor 101 (see FIG. 2).
- the processor 101 functions as a component of each of the extraction unit 111, the determination unit 112, and the reflection unit 113 by reading and executing computer programs.
- the extraction unit 111 extracts, from the feature map input to the attention mechanism unit 110, a first feature map associated with a first feature composed of a plurality of first components, a second map composed of a plurality of second components, and a A second feature map related to the features of and a third feature map related to the third features are extracted.
- the first feature, the second feature, and the third feature may be, for example, queries, keys, and values, respectively.
- the first feature map, the second feature map, and the third feature map are the query feature map, the key feature map, and the value feature map, respectively.
- each feature and feature map is not limited to this example.
- the determination unit 112 determines a correspondence relationship indicating a plurality of corresponding second components for each first component. Specifically, the determining unit 112 shifts a grid pattern indicating a plurality of second components corresponding to one first component on the second feature map based on the position of each first component. to determine this correspondence relationship.
- the definition of the grid pattern is as described above.
- a correlation between the first feature and the second feature is calculated from the correspondence determined by the determination unit 112 .
- the reflecting unit 113 performs processing to reflect this correlation in the third feature map.
- the information processing apparatus 10 can extract features in the input feature map.
- FIG. 4 is a flow chart showing the operation flow of the information processing apparatus 11 according to the second embodiment.
- the extraction unit 111 extracts a first feature map related to the first feature, a first A second feature map related to the second feature and a third feature map related to the third feature are extracted (step S11; extraction step).
- the determination unit 112 determines a correspondence relationship indicating a plurality of corresponding second components for each first component (step S12; determination step). Specifically, as described above, the determiner 112 determines this correspondence by shifting the grid pattern on the second feature map based on the position of each first component.
- the reflecting unit 113 reflects the correlation between the first feature and the second feature calculated from the correspondence relationship in the third feature map (step S13; reflecting step).
- the determining unit 112 uses a grid pattern indicating a plurality of second components corresponding to one first component to determine, for each first component, a corresponding plurality of second components. Determine the correspondence that indicates the elements.
- the reflecting unit 113 reflects the correlation calculated from the correspondence determined by the determining unit 112 in the third feature map. Therefore, the information processing apparatus 11 does not need to perform calculations for the entire area of the second feature map for each first component in the calculation based on the correspondence relationship, so the amount of calculation required for processing can be reduced.
- the grid pattern can extract not only a local area but a wide area of the second feature map, the information processing apparatus 11 can extract a wide area feature of the second feature map. can.
- the attention mechanism is a technique that reflects the correlation of extracted features to the extracted features.
- the computational cost increases.
- the advantage of the attention mechanism of feature extraction may be degraded.
- the information processing apparatus 11 is capable of feature extraction considering the entire space of the input feature map, and can be calculated with a low calculation cost.
- FIG. 5 is a block diagram showing the functional configuration of an information processing apparatus according to the second embodiment.
- the information processing apparatus 12 according to the second embodiment includes a caution mechanism unit 120 as a processing block for realizing its function.
- the attention mechanism unit 120 includes an extractor 121 , a calculator 122 , a totalizer 123 and an outputter 124 .
- each of the extraction unit 121, the calculation unit 122, the summation unit 123, and the output unit 124 may be implemented by the above-described processor 101 (see FIG. 1). That is, the processor 101 functions as components of each of the extraction unit 121, the calculation unit 122, the totalization unit 123, and the output unit 124 by reading and executing a computer program.
- the extraction unit 121 corresponds to the extraction unit 111 in the first embodiment. Specifically, the extraction unit 121 acquires a feature map (feature amount), which is input data to the attention mechanism unit 120, and extracts the three embedded features necessary for the processing of the attention mechanism from the acquired feature map. , key, and value feature maps.
- the extraction unit 121 may use, for example, a convolutional layer or a fully connected layer used in a convolutional neural network. Furthermore, an arbitrary layer constituting a convolutional neural network may be provided before the extraction unit 121, and an input from the layer may be input to the extraction unit 121 as a feature map.
- the extraction unit 121 outputs the extracted query and key to the calculation unit 122 and outputs the value to the aggregation unit 123 .
- the calculation unit 122 corresponds to the determination unit 112 in the first embodiment. Specifically, the calculation unit 122 calculates a correlation (for example, Matmul) between the query and the key using the embedded feature of the extracted query and key.
- a correlation for example, Matmul
- the computing unit 122 can refer to the entire space of the input feature map in the computation process.
- the grid pattern in the second embodiment is a grid-like pattern in which one unit is a square, and one grid point (one unit of the reference area) is composed of one component of the key. There is.
- the computing unit 122 may obtain the correlation by performing tensor shape transformation (reshape) on the embedded features of the query and the key, and then calculating the matrix product. Alternatively, the calculation unit 122 may calculate the correlation by combining the two embedded features after performing tensor shape transformation on the embedded features of the query and the key. The calculation unit 122 performs convolution and calculation of a Rectified Linear Unit (ReLU) on the matrix product or the combined features calculated as described above, thereby obtaining the final correlation feature get the map.
- ReLU Rectified Linear Unit
- calculation unit 122 may further include a convolution layer for convolution. Further, the calculation unit 122 may normalize the obtained feature map indicating the correlation from 0 to 1 by using a sigmoid function, a softmax function, or the like, or may not perform such normalization. A feature map indicating the calculated correlation is input to the aggregation unit 123 .
- the aggregation unit 123 corresponds to the reflection unit 113 in the first embodiment. More specifically, the aggregating unit 123 uses the feature map indicating the correlation calculated by the computing unit 122 and the value, which is the embedded feature extracted by the extracting unit 121, to convert the correlation between the query and the key into a value.
- the processing to reflect to the feature map of is performed. This process reflects the correlation by calculating the Hadamard product of the feature map of the correlation (weight) calculated by the calculation unit 122 and the value.
- a feature map reflecting the correlation is input to the output unit 124 .
- the output unit 124 performs adjustment processing for passing the calculated feature map to the feature extraction unit at the latter stage of the attention mechanism unit 120 .
- the output unit 124 mainly executes linear transformation processing and residual processing as adjustment processing.
- the output unit 124 may process the feature map by using a 1 ⁇ 1 convolutional layer or a fully connected layer as linear transformation processing. However, the output unit 124 may perform residual processing without performing this linear transformation processing.
- the output unit 124 may add the feature input to the extraction unit 121 and the feature map output from the aggregation unit 123 as residual processing. This is to prevent the feature map from being generated from the output unit 124 even if the correlation is not calculated.
- 0 is calculated as the correlation (weight)
- the value value is multiplied by 0, so that the feature value becomes 0 (disappears) in the feature map output by the aggregation unit 123.
- the output unit 124 performs residual processing to add the features of the input map so that the feature value does not become 0 even if 0 is calculated as the correlation.
- the output unit 124 outputs the adjusted feature map as output data.
- FIG. 6 is a flow chart showing the operation flow of the information processing apparatus according to the second embodiment.
- the extraction unit 121 first extracts embedded features from the input feature map (step S21).
- the calculation unit 122 uses the query and the key, which are the extracted embedding features, to calculate features indicating the correlation between the two (step S22).
- the aggregation unit 123 reflects the correlation on the value, which is the input feature (step S23).
- the output unit 124 adjusts the response values of the feature map in order to output the feature map extracted by the aggregation unit 123 (step S24).
- FIG. 7 is a schematic diagram showing the processing of the information processing device 12 in more detail, and the details of the processing will be explained using this diagram.
- the feature map input to the attention mechanism unit 120 is separated into query, key, and value feature maps by the extractor 121 .
- the calculation unit 122 calculates a feature that indicates the correlation between the query and the key.
- the aggregation unit 123 reflects the calculated correlations on the values extracted by the extraction unit 121 to generate a feature map.
- the output unit 124 performs linear response processing and residual processing on the feature map to adjust the response values of the feature map and generate a new feature map. Note that the arrows shown in FIG. 7 simply indicate the flow of data described in this embodiment, and prevent data processing in other modes in the attention mechanism unit 120. is not. In other words, the depiction of FIG. 7 does not exclude bi-directional exchange of data between portions of attention mechanism unit 120 .
- the technique described in this disclosure uses a grid pattern when determining the reference position of the key corresponding to the specific position i of the query. Specifically, the calculation unit 122 shifts the grid pattern in the small regions (divided regions) in the query feature map (first feature map) and refers to the key feature map (second feature map). , can refer to all features in the space of keys.
- the calculation unit 122 can equally refer to the entire space of keys within each subregion of the query.
- the input data is image data
- its constituent elements are pixels.
- the horizontal direction in each square feature map is set as the x direction
- the vertical direction is set as the y direction.
- FIG. 8A shows the reference positions, which are the reference positions of a plurality of keys, when the reference position i on the query side is taken as the reference position.
- the area surrounded by a thick line in the query in FIG. 8A indicates a square 3*3 area A that is a small area (block area) of the query, and the area surrounded by a thick line in the key is the reference area in the query i. showing.
- the reference position of the query is the upper left pixel in area A.
- the computing unit 122 refers to the key embedding features in a grid-like manner and generally coarsely.
- the key to be actually referenced is 9 pixels in the 7*7 key reference area.
- the calculation unit 122 determines the reference position of the key using the size N*N and the division number S of the feature map of the query and key.
- the skip width of the reference area in the key (the size of the grid, that is, the amount of positional deviation in the x-axis direction or the y-axis direction between the closest key components to be referenced) is also B.
- the calculation unit 122 calculates the grid pattern related to the reference position.
- FIG. 8B shows the key reference position when the query reference position in area A is shifted from the reference position.
- Position 1 on the query side is the position when the reference position of the query is shifted from the reference position by +1 in the x-axis direction
- position 2 on the query side is the position when the reference position of the query is shifted from the reference position by +2 in the x-axis direction
- y This is the position when shifted +2 in the axial direction.
- the calculation unit 122 shifts the reference position of the key by the same amount as the shift amount (movement amount) between the x-axis and the y-axis of the query.
- the calculation unit 122 shifts the key grid pattern (reference position) by +1 in the x-axis direction to position 1, and the reference position of the query is position 2.
- the key grid pattern (reference position) is shifted +2 in the x-axis direction and +2 in the y-axis direction to position 2.
- the computing unit 122 can refer to the entire space of the feature map within the key within the small region of the query.
- FIG. 8C shows a state in which the query feature map is divided into 9 small areas A to I.
- the calculation unit 122 calculates the upper left block in each small area for each query in each small area B to I of the query. A deviation amount in the x-axis direction and the y-axis direction is derived. Then, the calculation unit 122 generates a grid pattern in which the key corresponding to each query in each of the small regions B to I is shifted using the shift amount in the key feature map, similarly to each query in the small region A. Determined by reference.
- the same hatched locations in the query map of FIG. 8C refer to the same locations in the grid pattern in the within-key feature map.
- the computing unit 122 can evenly refer to the entire space of the embedded feature map of the key within each small region in the query.
- the regularization method introduced by the technology described in this disclosure will be described.
- the position of the grid pattern corresponding to the query is fixed. Therefore, if there is no posture change or positional deviation of the object in the input image data during learning, and there is a posture change or positional deviation of the object in the input image data during operation, the calculation unit 122 may not be able to accurately extract the features. have a nature. In order to prevent this, the calculation unit 122 randomly shuffles (replaces) the grid pattern of the key corresponding to the query with a certain probability.
- FIG. 8D shows that some of the sub-regions B and F have been shuffled with respect to the example shown in FIG. 8C.
- a shuffled part of the small area B is shown as area S1
- a shuffled part of small area F is shown as area S2.
- the multiple keys to be shuffled are in the same small area. Thereby, the calculation unit 122 can reliably execute the shuffle process.
- FIG. 9 is a flow chart showing the detailed operation flow of the calculation unit 122 .
- the calculation unit 122 calculates a grid pattern for the reference position using the embedded feature of the key (step S25). Then, the calculation unit 122 shifts the calculated checkerboard pattern using the deviation amount from the reference position within the query small region, thereby assigning the grid pattern to all the elements within the query small region. (Step S26).
- the computing unit 122 allocates grid patterns to all other small regions of the query in a similar manner (step S27). Then, the calculation unit 122 introduces a process of shuffling the grid pattern to be assigned at an arbitrary position within the key block with a certain probability (step S28). The details of each of these steps are as described in the description of FIGS. 8A to 8D. As described above, the calculation unit 122 assigns the grid pattern of the query to each position of the feature map of the query.
- Non-Patent Document 1 refers to the entire spatial location of the embedded feature of the key for that pixel i in order to refer to the entire feature for pixel i at a specific position of the query. There is a need to.
- the input to the attention mechanism is an image or other two-dimensional feature map, the amount of computations performed tends to depend on the input resolution. become difficult.
- Non-Patent Document 2 refers to the key position of a local area (about 7*7) for pixel i at a specific position of the query in order to reduce the amount of computation that depends on the resolution. This greatly reduces the amount of calculations to be performed.
- the technique described in this disclosure efficiently uses the grid pattern to cover the entire space of the feature map with a smaller amount of calculation than the technique of Non-Patent Document 1 (for example, the same amount as that of Non-Patent Document 2). calculation amount). This makes it easier for the information processing device to refer to the wide-area feature space, so that it is possible to improve the feature extraction capability of the attention mechanism.
- the information processing apparatus 12 exhibits a remarkable technical effect of being able to suppress such a state in which the computational processing load becomes extremely large.
- the calculation unit 122 (determination unit) can determine the correspondence relationship between the query component (first component) and the key component (second component) as follows.
- the operator 122 shifts the grid pattern on the key feature map based on the position of each query component such that the key component corresponds to at least one query component.
- the computing unit 122 can evenly refer to the entire space of the key feature map. Therefore, the attention mechanism unit 120 can extract all features of the input data.
- the computing unit 122 can determine the correspondence between query components and key components as follows.
- the calculation unit 122 divides the query feature map (first feature map) into a plurality of sub-regions (divided regions) so that the key components correspond to at least one of the query components in the sub-regions.
- the grid pattern is shifted on the key feature map based on the position of the query components.
- the computing unit 122 can evenly refer to the entire space of the feature map of the key each time it refers to a small area of the query. Therefore, the attention mechanism unit 120 can broadly extract the features of the input data without bias.
- the calculation unit 122 generates a grid pattern based on the position of each query component so that each component of the key corresponds to one of the query components in each small region. By shifting up, the correspondence can be determined. Therefore, the attention mechanism unit 120 can extract the features of the input data evenly.
- the computing unit 122 can shift the grid pattern on the key feature map based on the position of each component of the query as follows. That is, the calculation unit 122 sets query components in a one-to-one correspondence between all sub-regions, and sets the corresponding query components at the same position on the feature map whose grid pattern is the key. can be set to be placed By setting the shift method of the grid pattern to such a simple setting, the calculation unit 122 can reduce the calculation cost for evenly referencing the characteristics of the input data.
- the calculation unit 122 may determine the correspondence relationship by shuffling the positions on the feature map of the keys of the grid pattern determined according to the position of each component of the query with a predetermined probability. . This enables the attention mechanism unit 120 to perform robust feature extraction against posture changes and positional deviations of objects in the input image data.
- the calculation unit 122 can configure a query subregion with a congruent figure (for example, a square) that includes a plurality of key components.
- a congruent figure for example, a square
- the calculation unit 122 can reduce the calculation cost for evenly referencing the features of the input data by simplifying the setting of the small areas.
- a third embodiment will be described below with reference to the drawings.
- the third embodiment shows an example in which the information processing apparatus 11 constructs one network by repeatedly stacking the attention mechanism units 120 shown in the second embodiment.
- specific application examples of the attention mechanism unit 120 shown in the second embodiment will be described. Therefore, in the descriptions of the third to fifth embodiments, some configurations and processes that are different when compared with the second embodiment are described, and other configurations and processes that are not described are common to the second embodiment. may be applied.
- constituent elements denoted by the same reference numerals perform the same processing.
- FIG. 10 is a block diagram showing a functional configuration using the information processing device 13.
- the information processing device 13 comprises a convolution unit (feature extraction unit) 200 and a plurality of attention mechanism units 120 .
- the convolution unit 200 used in the convolutional neural network at the top of the information processing device 13
- the information processing device 13 can extract a feature map from the inputted input image.
- the convolution unit 200 is a unit that performs feature extraction by using a convolution layer of local kernels (approximately 3 ⁇ 3) on the key feature map.
- the caution mechanism unit 120 is repeatedly arranged in the information processing device 13 for a specified number of times.
- the entire network is constructed by arranging an output layer (not shown) that outputs some result for the input image in the information processing device 13 .
- FIG. 11 is a flow chart showing the operation flow of the information processing device 13 according to the third embodiment.
- the convolution unit 200 first extracts a feature map from the input image data (step S31). Subsequently, the feature map output in step S31 is input to the attention mechanism unit 120 and converted into a new feature map in the attention mechanism unit 120 (step S32). Step S32 is repeatedly executed N times, which is the specified number of times (that is, the number of times attention mechanism unit 120 is provided), thereby extracting a new feature map. Subsequently, after finishing all the processes of the attention mechanism unit 120, the information processing device 13 obtains a response value from the final output layer (step S33).
- a network is constructed using a plurality of attention mechanism units 120.
- the attention mechanism unit 120 can refer to the global feature space with a small amount of computation. Therefore, the information processing device 13 can construct a network specialized for extracting features from the entire image. Specifically, the information processing device 13 is considered particularly effective for tasks that require feature extraction from wide-area information, such as image recognition tasks for recognizing landscapes.
- the fourth embodiment shows an example of constructing a network by repeatedly stacking attention mechanism unit 120 and convolution unit (feature extraction unit) 200, which are the techniques described in this disclosure.
- the convolution unit 200 is a unit that performs feature extraction using a convolution layer of local kernels (approximately 3 ⁇ 3), as described above.
- FIG. 12 is a block diagram showing the functional configuration of the information processing device 14 including the attention mechanism unit 120 and the convolution unit 200.
- the information processing device 14 can extract the feature map from the input image.
- the attention mechanism unit 120 and the convolution unit 200 are repeatedly arranged for a specified number of times.
- the designer can freely determine the order in which the attention mechanism unit 120 and the convolution unit 200 are arranged, and how to arrange which of them in succession.
- FIG. 12 is a block diagram showing the functional configuration of the information processing device 14 including the attention mechanism unit 120 and the convolution unit 200.
- a plurality of groups are provided in the information processing device 14, each having the caution mechanism unit 120 at the front stage and the convolution unit 200b at the rear stage.
- one network is constructed by arranging an output layer (not shown) that outputs some result for the input image in the information processing device 14 .
- FIG. 13 is a flow chart showing the operation flow of the information processing device 14 according to the fourth embodiment.
- step S41 the front-stage convolution unit 200X extracts a feature map from the input image data
- step S42 the feature map output in step S41 is input to attention mechanism unit 120 or convolution unit 200 in the subsequent stage, and converted into a new feature map in each unit (step S42).
- step S42 is repeatedly executed N times, which is the specified number of times (that is, N times, which is the number of times the attention mechanism unit 120 and the convolution unit 200 are provided), and a new feature map is extracted each time.
- step S43 the information processing device 14 obtains response values from the final output layer
- a network is constructed by using the attention mechanism unit 120 and the convolution unit 200 of the technology described in this disclosure.
- the convolution unit 200 performs feature extraction using a convolution layer with a local kernel (approximately 3 ⁇ 3) as a kernel in a predetermined range, so feature extraction focusing on a local region in the data is possible. . Therefore, the information processing device 14 can construct a network that enables feature extraction in consideration of two viewpoints, that is, the entire image and the local area of the image.
- the information processing device 14 can improve various types of recognition performance, such as general object recognition and object detection, in situations where objects of various types and sizes are mixed in an image.
- the fifth embodiment constructs a network by repeatedly stacking the attention mechanism unit 120 and the patch-based attention mechanism unit (feature extraction unit) 210, which are the techniques described in this disclosure.
- the patch-based attention mechanism unit 210 applies the patch-based attention mechanism described in Non-Patent Document 2, and as shown in FIG. It is a unit that performs feature extraction using a convolutional layer with a degree). Note that the description of the patch-based attention mechanism described in Non-Patent Document 2 is incorporated in this disclosure.
- FIG. 14 is a block diagram showing the functional configuration of the information processing device 15 including the attention mechanism unit 120, the convolution unit 200 and the patch-based attention mechanism unit 210.
- FIG. 14 By providing the convolution unit 200 at the frontmost stage of the information processing device 15, a feature map can be extracted from the input image. Then, the caution mechanism unit 120 and the patch-based caution mechanism unit 210 are repeatedly arranged at the succeeding stage N times, which is the designated number of times.
- the designer can freely determine the order of arranging the attention mechanism unit 120 and the patch-based attention mechanism unit 210 and how to arrange which of them in succession.
- the information processing device 15 includes a plurality of groups in which the caution mechanism unit 120 is provided in the front stage and the patch-based caution mechanism unit 210 is provided in the rear stage. Finally, the entire network is constructed by arranging an output layer (not shown) that outputs some result for the input image in the information processing device 15 .
- step S41 The feature map output in step S41 is input to the attention mechanism unit 120 or the patch-based attention mechanism unit 210 in the latter stage, where it is converted into a new feature map (step S42).
- Step S42 is repeatedly executed N times, which is the specified number of times (that is, the number of times the caution mechanism unit 120 and the patch-based caution mechanism unit 210 are provided). Then, the information processing device 15 performs the process of step S43.
- a network is constructed using the caution mechanism unit 120 and the patch-based caution mechanism unit 210.
- FIG. Since the patch-based attention mechanism unit 210 performs feature extraction using a convolution layer with a local kernel (approximately 7 ⁇ 7) as a kernel in a predetermined range, feature extraction focusing on a local region in the data is possible. It is possible.
- the patch-based attention mechanism unit 210 has the same function as the convolution unit 200 in terms of feature extraction from local regions, but is superior to the convolution unit 200 in terms of accuracy and computational complexity.
- the patch-based attention mechanism unit 210 as a substitute for the convolution unit 200, a higher performance network can be constructed. For these reasons, it is possible to construct a network that enables feature extraction that considers two perspectives: the entire image and local regions of the image.
- a specific application example of the information processing device 15 is the same as that of the fourth embodiment, and various types of recognition performance such as general object recognition and object detection in a situation where objects of various types and sizes are mixed in an image. It is considered possible to improve
- the extraction unit 111 extracts, from the feature map input to the attention mechanism unit 110, a first feature map associated with a first feature composed of a plurality of first components, a second map composed of a plurality of second components, and a A second feature map related to the features of and a third feature map related to the third features are extracted.
- the first, second and third features are query, key and value respectively.
- each feature map is a one-dimensional map.
- the determination unit 112 determines a correspondence relationship indicating the components of the corresponding multiple keys for each query component. Specifically, the determining unit 112 shifts a grid pattern indicating multiple key components corresponding to one query component on the key feature map based on the position of each query component. This correspondence is determined so that the components of the key correspond to the components of at least one query. In other words, the correspondence indicates, for each component of the query, the correspondence of the components of the corresponding plurality of keys.
- a grid pattern is a pattern in which the closest key components (reference regions) have the same spacing on a one-dimensional map. Note that the size of the grid is 3 in FIG. In this way, even when applying the technique of this disclosure to a one-dimensional feature vector, the determination unit 112 determines the reference positions of the closest keys at regular intervals, as in the case of a two-dimensional feature map. be able to.
- the reflecting unit 113 performs a process of reflecting the correlation between the query and the key calculated from the correspondence determined by the determining unit 112 in the value feature map.
- the information processing apparatus 10 can extract features in the input feature map.
- the extraction unit 111 extracts query, key, and value feature maps from the feature maps input to the attention mechanism unit 110 .
- the determination unit 112 refers to the designated grid pattern for a specific component (reference position) of the query. In FIG. 15, grid pattern (1) is specified for component i of the query.
- the determining unit 112 converts grid pattern (2) or (3), which is obtained by shifting grid pattern (1) by the same shift amount as the shift amount of the query component shifted from the reference position, into a reference grid pattern. Specify and assign as At this time, the determination unit 112 may randomly change the grid pattern of the key to be referenced with a predetermined probability for the components of the query, as in the case of the two-dimensional feature map.
- the network may be constructed with the attention mechanism units described in this disclosure, and as in the fourth and fifth forms, attention mechanism units described in this disclosure and different A network may be constructed by combining with the feature extraction unit. From this correspondence determined by the determination unit 112, the correlation between the query and the key is calculated. Then, the reflecting unit 113 reflects the correlation in the value feature map.
- the tasks that can be handled are not limited to images, and can be applied to one-dimensional data tasks such as speech and natural language processing.
- one unit of the grid pattern is a square.
- one unit of the grid pattern may be a rectangle of any shape instead of a square.
- the calculation unit 122 may configure the query subregion with different shapes having the same area, instead of a congruent figure containing a plurality of key components.
- the caution mechanism unit 110 may be stacked inside the information processing apparatus.
- the same methods as the examples described in the third to fifth embodiments are described in this disclosure.
- the attention mechanism unit can also be stacked in the information processing device.
- One or more processors of each device in the above-described embodiments execute one or more programs containing instruction groups for causing the computer to execute the algorithms described using each drawing. By this processing, the signal processing method described in each embodiment can be realized.
- Non-transitory computer readable media include various types of tangible storage media.
- Examples of non-transitory computer-readable media include magnetic recording media (e.g., flexible discs, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical discs), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)).
- the program may also be delivered to the computer on various types of transitory computer readable medium. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. Transitory computer-readable media can deliver the program to the computer via wired channels, such as wires and optical fibers, or wireless channels.
- the determining unit divides the first feature map into a plurality of divided regions, and divides the grid so that each of the second components corresponds to at least one of the first components in each of the divided regions. determining the correspondence by shifting patterns on the second feature map based on the position of each of the first components; The information processing device according to appendix 2. (Appendix 4) The determining unit determines the grid pattern based on the position of each first component so that each second component corresponds to one of the first components in each divided area. determining the correspondence by shifting on a second feature map; The information processing device according to appendix 3.
- the determining unit sets the first constituent elements in a one-to-one correspondence between all the divided regions, and the grid pattern is the same on the second feature map for the corresponding first constituent elements. determining the correspondence by shifting the grid pattern on the second feature map based on the position of each of the first components so that it is placed in position; The information processing device according to appendix 4. (Appendix 6) The determination unit determines the correspondence relationship by shuffling the positions of the grid pattern determined according to the position of each of the first components on the second feature map, with a predetermined probability. The information processing device according to appendix 5. (Appendix 7) The determining unit configures each divided area with a congruent figure including a plurality of the first components, 7.
- the information processing apparatus according to any one of Appendices 3 to 6.
- Appendix 8 A plurality of attention mechanism units having the extraction unit, the determination unit, and the reflection unit, 8.
- the information processing apparatus according to any one of Appendices 1 to 7.
- Appendix 9 A plurality of feature extraction units using kernels within a predetermined range and the attention mechanism units;
- the information processing device according to appendix 8.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
Description
まず、関連技術の概要について説明する。第1の関連技術として、非特許文献のX.Wang, R. Girshick, A. Gupta, K. He, “Non-Local Neural Networks”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794-7803, 2018. は、畳み込みニューラルネットワークの畳み込み層から得られた特徴マップを用いて、注意機構によって特徴マップに重み付けをすることで特徴抽出を改善した技術を開示している。
各実施の形態について説明する前に、図2を用いて、各実施の形態にかかる情報処理装置のハードウェア構成について説明する。
まず、図3、4を参照して、第1実施形態について説明する。
図3は、第1実施形態に係る情報処理装置の機能的構成を示すブロック図である。図3に示すように、第1実施形態に係る情報処理装置11は、その機能を実現するための処理ブロックとして、注意機構ユニット110を備える。注意機構ユニット110は、抽出部111、決定部112及び反映部113を備えている。なお、抽出部111、決定部112及び反映部113の各々は、上述したプロセッサ101(図2参照)によって実現されてよい。つまり、プロセッサ101は、コンピュータプログラムを読み込み、実行することで、抽出部111、決定部112及び反映部113の各々の構成要素として機能する。
次に、図4を参照しながら、第1実施形態に係る情報処理装置11の動作の流れについて説明する。図4は、第2実施形態に係る情報処理装置11の動作の流れを示すフローチャートである。
次に、第1実施形態に係る情報処理装置11によって得られる技術的効果について説明する。上述で説明した通り、決定部112は、1個の第1構成要素に対応する複数個の第2構成要素を示すグリッドパターンを用いて、各第1構成要素について、対応する複数の第2構成要素を示す対応関係を決定する。反映部113は、決定部112が決定した対応関係から算出された相関関係を第3特徴マップに反映させる。そのため、情報処理装置11は、対応関係に基づく計算において、各第1構成要素について、第2特徴マップの全領域に関する計算をしなくて済むため、処理に必要な計算量を少なくすることができる。また、グリッドパターンにより、第2特徴マップの局所的な領域でなく、広範囲の領域を抽出することができるため、情報処理装置11は、第2特徴マップについて、広域的な特徴を抽出することができる。
次に、図5、6を参照して、第2実施形態について説明する。第2実施形態では、第1実施形態の具体的な適用例について説明する。
図5は、第2実施形態に係る情報処理装置の機能的構成を示すブロック図である。図5に示すように、第2実施形態に係る情報処理装置12は、その機能を実現するための処理ブロックとして、注意機構ユニット120を備える。注意機構ユニット120は、抽出部121、演算部122、集計部123、出力部124を備えている。なお、抽出部121、演算部122、集計部123、出力部124の各々は、上述したプロセッサ101(図1参照)によって実現されてよい。つまり、プロセッサ101は、コンピュータプログラムを読み込み、実行することで、抽出部121、演算部122、集計部123、出力部124の各々の構成要素として機能する。
次に、図6を参照しながら、第2実施形態に係る情報処理装置12の動作の流れについて説明する。図6は、第2実施形態に係る情報処理装置の動作の流れを示すフローチャートである。
演算部122がキーの特徴マップを参照する方法の詳細について、さらに説明する。この開示に記載の技術では、クエリの特定位置iに対応するキーの参照位置を決定する際に、グリッドパターンを用いる。詳細には、演算部122は、クエリの特徴マップ(第1特徴マップ)内の小領域(分割領域)の中でグリッドパターンをずらしながらキーの特徴マップ(第2特徴マップ)を参照することにより、キーの空間内の特徴を全て参照することができる。加えて、クエリの小領域内でキーの空間内の全構成要素を参照できる特性を活かし、クエリの他の小領域内を繰り返しグリッドパターンでずらしながらキーの特徴マップを参照することで、演算部122は、クエリの各小領域内で、キーの空間全体を均等に参照できる。
さらに、この開示に記載の技術で導入される正則化方法について説明する。ここまでの処理では、クエリと対応するグリッドパターンの位置が固定されている。そのため、学習中の入力画像データに物体の姿勢変化や位置ずれが存在せず、運用中の入力画像データに物体の姿勢変化や位置ずれが生じるとき、演算部122が特徴を正確に抽出できない可能性がある。これを防ぐために、演算部122は、クエリに対応するキーのグリッドパターンをランダムに一定の確率でシャッフルする(入れ替える)処理を施す。
次に、図9を参照しながら、演算部122の詳細な動作の流れについて説明する。図9は、演算部122の詳細な動作の流れを示すフローチャートである。
次に、第2実施形態に係る情報処理装置12によって得られる技術的効果について説明する。
以下、図面を参照して、第3実施形態について説明する。第3実施形態では、第2実施形態で示した注意機構ユニット120が繰り返し積層されて設けられることで、情報処理装置11が一つのネットワークを構築する例を示す。なお、第3~第5実施形態では、第2実施形態で示した注意機構ユニット120の具体的な適用例について説明がなされる。そのため、第3~第5実施形態の説明では、第2実施形態と比較した際において異なる一部の構成及び処理が説明され、説明されないその他の構成及び処理については、第2実施形態と共通のものが適用されてもよい。また、第3~第5実施形態の説明において、同一の符号が付された構成要素は、同一の処理を実行するものである。
図10を参照しながら、情報処理装置13を用いた第3実施形態について説明する。図10は、情報処理装置13を用いた機能的構成を示すブロック図である。情報処理装置13は、畳み込みユニット(特徴抽出ユニット)200及び複数の注意機構ユニット120を備える。情報処理装置13において、最上段に畳み込みニューラルネットワークで用いられる畳み込みユニット200を設けることで、情報処理装置13が、入力された入力画像から特徴マップを抽出することができる。畳み込みユニット200は、キーの特徴マップに関しての局所的なカーネル(3×3程度)の畳み込み層を用いることで、特徴抽出を行うユニットである。その後、情報処理装置13内に、注意機構ユニット120を指定する回数だけ繰り返し配置する。最後に、入力画像に対する何かしらの結果を出力する出力層(不図示)を情報処理装置13内に配置することで、ネットワーク全体を構築する。
次に、図11を参照しながら、第3実施形態に係る情報処理装置13の動作の流れについて説明する。図11は、第3実施形態に係る情報処理装置13の動作の流れを示すフローチャートである。
次に、第3実施形態に係る情報処理装置13によって得られる技術的効果について説明する。図10および図11で説明したように、第3実施形態に係る情報処理装置13では、複数の注意機構ユニット120を用いてネットワークが構築されている。第1実施形態で記載した通り、注意機構ユニット120は、少ない計算量で、広域的な特徴空間を参照することができる。そのため、情報処理装置13によって、画像全体から特徴を抽出することに特化したネットワークを構築することが可能である。具体的には、情報処理装置13は、広域的な情報からの特徴抽出が必要なタスク、例として風景を認識する画像認識タスク等に対して特に有効であると考えられる。
以下、図面を参照して、第4実施形態について説明する。第4実施形態は、この開示に記載の技術である注意機構ユニット120と畳み込みユニット(特徴抽出ユニット)200を繰り返し積み重ねることでネットワークを構築する例を示す。畳み込みユニット200は、上述の通り、局所的なカーネル(3×3程度)の畳み込み層を用いて特徴抽出を行うユニットである。
図12を参照しながら、注意機構ユニット120と畳み込みユニット200を用いた第4実施形態について説明する。図12は、注意機構ユニット120と畳み込みユニット200を備えた情報処理装置14の機能的構成を示すブロック図である。情報処理装置14の最も前段に畳み込みユニット200Xを設けることで、情報処理装置14が、入力画像から特徴マップを抽出することができる。そして、その後段に、注意機構ユニット120と畳み込みユニット200を、指定する回数だけ繰り返し配置する。ここで、注意機構ユニット120と畳み込みユニット200を配置する順番、尚且つどちらを連続でどう配置するかは、設計者が自由に決めることができる。図12の例では、注意機構ユニット120を前段に、畳み込みユニット200bがその後段に設けられた組が、情報処理装置14内に複数設けられている。最後に、入力画像に対する何かしらの結果を出力する出力層(不図示)を情報処理装置14内に配置することで、1つのネットワークを構築する。
次に、図13を参照しながら、第4実施形態に係る情報処理装置14の動作の流れについて説明する。図13は、第4実施形態に係る情報処理装置14の動作の流れを示すフローチャートである。
次に、第4実施形態に係る情報処理装置14によって得られる技術的効果について説明する。図12および図13で説明したように、第4実施形態に係る情報処理装置14では、この開示に記載の技術の注意機構ユニット120と畳み込みユニット200が用いられることで、ネットワークが構築されている。畳み込みユニット200は、所定の範囲のカーネルとして、局所的なカーネル(3×3程度)の畳み込み層を用いて特徴抽出を行うため、データ中の局所的な領域に着目した特徴抽出が可能である。そのため、情報処理装置14によって、画像全体と画像の局所的な領域との2つの観点を考慮した特徴抽出を可能にしたネットワークを構築することができる。情報処理装置14は、様々な種類や大きさの物体が画像中に混在する状況での一般物体認識や物体検出等、様々な種類の認識性能を向上させることが可能である。
以下、図面を参照して、第5実施形態について説明する。第5実施形態は、この開示に記載の技術である注意機構ユニット120とパッチベース注意機構ユニット(特徴抽出ユニット)210を繰り返し積み重ねることでネットワークを構築する。パッチベース注意機構ユニット210は、非特許文献2に記載されたパッチベースの注意機構を適用したものであり、図1Cに示した通り、キーの特徴マップに関して、一部領域のパッチ(7*7程度)の畳み込み層を用いて特徴抽出を行うユニットである。なお、非特許文献2に記載のパッチベースの注意機構に関する説明は、この開示において援用される。
図14を参照しながら、注意機構ユニット120、畳み込みユニット200及びパッチベース注意機構ユニット210を用いた第4実施形態について説明する。図14は、注意機構ユニット120、畳み込みユニット200及びパッチベース注意機構ユニット210を備えた情報処理装置15の機能的構成を示すブロック図である。情報処理装置15の最も前段に畳み込みユニット200を設けることで、入力画像から特徴マップを抽出することができる。そして、その後段に、注意機構ユニット120とパッチベース注意機構ユニット210を、指定する回数であるN回、繰り返し配置する。ここで、注意機構ユニット120とパッチベース注意機構ユニット210を配置する順番、尚且つどちらを連続でどう配置するかは、設計者が自由に決めることができる。図14の例では、注意機構ユニット120を前段に、パッチベース注意機構ユニット210がその後段に設けられた組が、情報処理装置15内に複数設けられている。最後に、入力画像に対する何かしらの結果を出力する出力層(不図示)を情報処理装置15内に配置することで、ネットワーク全体を構築する。
次に、第5実施形態に係る情報処理装置15の動作の流れについて、図13を用いて説明する。なお、第4実施形態に係る動作と同一の点については、説明を省略する。
次に、第5実施形態に係る情報処理装置15によって得られる技術的効果について説明する。図13および図14で説明したように、第5実施形態に係る情報処理装置15では、注意機構ユニット120とパッチベース注意機構ユニット210を用いてネットワークが構築されている。パッチベース注意機構ユニット210は、所定の範囲のカーネルとして、局所的なカーネル(7×7程度)の畳み込み層を用いて特徴抽出を行うため、データ中の局所的な領域に着目した特徴抽出が可能である。パッチベース注意機構ユニット210は、局所領域から特徴抽出する点に関しては畳み込みユニット200と同じ機能を有するが、精度や計算量の観点で畳み込みユニット200より優れている。そのため、畳み込みユニット200の代用としてパッチベース注意機構ユニット210を用いることで、より高性能なネットワークを構築できる。これらの理由から、画像全体と画像の局所的な領域との2つの観点を考慮した特徴抽出を可能にしたネットワークを構築することができる。情報処理装置15の具体的な応用例は第4実施形態と同様で、様々な種類や大きさの物体が画像中に混在する状況での一般物体認識や物体検出等、様々な種類の認識性能を向上させることが可能であると考えられる。
以下、図面を参照して、第6実施形態について説明する。これまでの実施形態は、2次元の特徴マップを用いる画像系のタスクを例として、情報処理装置の動作を説明した。しかしながら、この開示の技術は、入力データが画像のような2次元データだけでなく、音声や自然言語処理のような1次元データである場合でも応用が可能である。
図15を参照しながら、1次元特徴を用いる場合の情報処理装置16について説明する。この情報処理装置の機能的構成の概要は、図3に示した通りであり、以下、第1実施形態と異なる点について、特に説明する。
まず、抽出部111は、注意機構ユニット110に入力された特徴マップから、クエリ、キー、バリューの各特徴マップを抽出する。決定部112は、クエリの特定の構成要素(基準位置)に対して、指定したグリッドパターンを参照する。図15では、クエリの構成要素iに対して、グリッドパターン(1)が指定されている。
第6実施形態では、取り扱えるタスクを画像のみでなく、音声や自然言語処理のような1次元データのタスクにも応用できる。
(付記1)
特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出する抽出部と、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定する決定部と、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映させる反映部と、
を備えた情報処理装置。
(付記2)
前記決定部は、各前記第2構成要素が少なくとも1個の前記第1構成要素に対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
付記1に記載の情報処理装置。
(付記3)
前記決定部は、前記第1特徴マップを複数の分割領域に分割し、各前記第2構成要素が少なくとも各前記分割領域におけるいずれか1個の前記第1構成要素に対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
付記2に記載の情報処理装置。
(付記4)
前記決定部は、各前記第2構成要素が、各前記分割領域におけるいずれか1個の前記第1構成要素と対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
付記3に記載の情報処理装置。
(付記5)
前記決定部は、全ての前記分割領域同士で、1対1に対応する前記第1構成要素を設定し、対応する前記第1構成要素同士について、前記グリッドパターンが前記第2特徴マップ上で同じ位置に配置されるよう、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
付記4に記載の情報処理装置。
(付記6)
前記決定部は、所定の確率で、各前記第1構成要素の位置に応じて決定される前記グリッドパターンの前記第2特徴マップ上での位置をシャッフルすることで、前記対応関係を決定する、
付記5に記載の情報処理装置。
(付記7)
前記決定部は、前記各分割領域を、複数個の前記第1構成要素を含む合同の図形で構成する、
付記3乃至6のいずれか1項に記載の情報処理装置。
(付記8)
前記抽出部と、前記決定部と、前記反映部を有する注意機構ユニットを複数備える、
付記1乃至7のいずれか1項に記載の情報処理装置。
(付記9)
所定の範囲のカーネルを用いた特徴抽出ユニット及び前記注意機構ユニットを複数備える、
付記8に記載の情報処理装置。
(付記10)
特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出する抽出ステップと、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定する決定ステップと、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映させる反映ステップと、
を情報処理装置が実行する情報処理方法。
(付記11)
特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出する抽出ステップと、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定する決定ステップと、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映させる反映ステップと、
を情報処理装置に実行させるプログラム。
101 プロセッサ
102 RAM
103 ROM
104 記憶装置
105 入力装置
106 出力装置
107 データバス
110 注意機構ユニット
111 抽出部
112 決定部
113 反映部
120 注意機構ユニット
121 抽出部
122 演算部
123 集計部
124 出力部
200 畳み込みユニット
210 パッチベース注意機構ユニット
Claims (11)
- 特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出する抽出手段と、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定する決定手段と、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映させる反映手段と、
を備えた情報処理装置。 - 前記決定手段は、各前記第2構成要素が少なくとも1個の前記第1構成要素に対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
請求項1に記載の情報処理装置。 - 前記決定手段は、前記第1特徴マップを複数の分割領域に分割し、各前記第2構成要素が少なくとも各前記分割領域におけるいずれか1個の前記第1構成要素に対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
請求項2に記載の情報処理装置。 - 前記決定手段は、各前記第2構成要素が、各前記分割領域におけるいずれか1個の前記第1構成要素と対応するように、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
請求項3に記載の情報処理装置。 - 前記決定手段は、全ての前記分割領域同士で、1対1に対応する前記第1構成要素を設定し、対応する前記第1構成要素同士について、前記グリッドパターンが前記第2特徴マップ上で同じ位置に配置されるよう、前記グリッドパターンを各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、前記対応関係を決定する、
請求項4に記載の情報処理装置。 - 前記決定手段は、所定の確率で、各前記第1構成要素の位置に応じて決定される前記グリッドパターンの前記第2特徴マップ上での位置をシャッフルすることで、前記対応関係を決定する、
請求項5に記載の情報処理装置。 - 前記決定手段は、前記各分割領域を、複数個の前記第1構成要素を含む合同の図形で構成する、
請求項3乃至6のいずれか1項に記載の情報処理装置。 - 前記抽出手段と、前記決定手段と、前記反映手段を有する注意機構ユニットを複数備える、
請求項1乃至7のいずれか1項に記載の情報処理装置。 - 所定の範囲のカーネルを用いた特徴抽出ユニット及び前記注意機構ユニットを複数備える、
請求項8に記載の情報処理装置。 - 特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出し、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定し、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映する、
ことを情報処理装置が実行する情報処理方法。 - 特徴マップから、複数の第1構成要素で構成された第1の特徴に係る第1特徴マップ、複数の第2構成要素で構成された第2の特徴に係る第2特徴マップ、及び第3の特徴に係る第3特徴マップを抽出し、
1個の前記第1構成要素に対応する複数個の前記第2構成要素を示すグリッドパターンを、各前記第1構成要素の位置に基づいて前記第2特徴マップ上でシフトすることにより、各前記第1構成要素について、対応する複数の前記第2構成要素を示す対応関係を決定し、
前記対応関係から算出された前記第1の特徴と前記第2の特徴との相関関係を前記第3特徴マップに反映する、
ことを情報処理装置に実行させるプログラムが格納された非一時的なコンピュータ可読媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023506783A JP7525051B2 (ja) | 2021-03-15 | 2022-01-13 | 情報処理装置、情報処理方法及びプログラム |
US18/271,649 US20240320957A1 (en) | 2021-03-15 | 2022-01-13 | Information processing apparatus, information processing method, and non-transitory computer-readable medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021-041852 | 2021-03-15 | ||
JP2021041852 | 2021-03-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022196060A1 true WO2022196060A1 (ja) | 2022-09-22 |
Family
ID=83320214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/000995 WO2022196060A1 (ja) | 2021-03-15 | 2022-01-13 | 情報処理装置、情報処理方法及び非一時的なコンピュータ可読媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240320957A1 (ja) |
JP (1) | JP7525051B2 (ja) |
WO (1) | WO2022196060A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024184973A1 (ja) * | 2023-03-03 | 2024-09-12 | 日本電気株式会社 | 動画処理装置、動画処理方法、及び、記録媒体 |
-
2022
- 2022-01-13 WO PCT/JP2022/000995 patent/WO2022196060A1/ja active Application Filing
- 2022-01-13 JP JP2023506783A patent/JP7525051B2/ja active Active
- 2022-01-13 US US18/271,649 patent/US20240320957A1/en active Pending
Non-Patent Citations (3)
Title |
---|
SALMAN KHAN; MUZAMMAL NASEER; MUNAWAR HAYAT; SYED WAQAS ZAMIR; FAHAD SHAHBAZ KHAN; MUBARAK SHAH: "Transformers in Vision: A Survey", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 February 2021 (2021-02-22), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081883407 * |
SANG HAIWEI; ZHOU QIUHAO; ZHAO YONG: "PCANet: Pyramid convolutional attention network for semantic segmentation", IMAGE AND VISION COMPUTING, ELSEVIER, GUILDFORD, GB, vol. 103, 7 August 2020 (2020-08-07), GUILDFORD, GB , XP086323926, ISSN: 0262-8856, DOI: 10.1016/j.imavis.2020.103997 * |
SOUVIK KUNDU; HESHAM MOSTAFA; SHARATH NITTUR SRIDHAR; SAIRAM SUNDARESAN: "Attention-based Image Upsampling", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 December 2020 (2020-12-17), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081841808 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024184973A1 (ja) * | 2023-03-03 | 2024-09-12 | 日本電気株式会社 | 動画処理装置、動画処理方法、及び、記録媒体 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2022196060A1 (ja) | 2022-09-22 |
US20240320957A1 (en) | 2024-09-26 |
JP7525051B2 (ja) | 2024-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11663691B2 (en) | Method and apparatus for restoring image | |
US11170469B2 (en) | Image transformation for machine learning | |
US10832034B2 (en) | Facial image generating method, facial image generating apparatus, and facial image generating device | |
CN110574025A (zh) | 用于合并交错通道数据的卷积引擎 | |
US20200057921A1 (en) | Image classification and conversion method and device, image processor and training method therefor, and medium | |
US11967043B2 (en) | Gaming super resolution | |
KR102442055B1 (ko) | 전자 장치 및 그 제어 방법 | |
US20220253642A1 (en) | Burst image-based image restoration method and apparatus | |
CN108875544B (zh) | 人脸识别方法、装置、系统和存储介质 | |
KR102239588B1 (ko) | 이미지 처리 방법 및 장치 | |
CN110991627A (zh) | 信息处理装置、信息处理方法 | |
WO2022196060A1 (ja) | 情報処理装置、情報処理方法及び非一時的なコンピュータ可読媒体 | |
CN112991254A (zh) | 视差估计系统、方法、电子设备及计算机可读存储介质 | |
KR20150099964A (ko) | 이미지 특징 추출 장치 및 방법 | |
JP6121302B2 (ja) | 姿勢パラメータ推定装置、姿勢パラメータ推定システム、姿勢パラメータ推定方法、およびプログラム | |
JP2017068577A (ja) | 演算装置、方法及びプログラム | |
CN112749576B (zh) | 图像识别方法和装置、计算设备以及计算机存储介质 | |
KR102482472B1 (ko) | 기계학습 기반의 꼭짓점 추출을 통해 기울어진 차량 번호판 이미지를 직사각형화시킬 수 있는 전자 장치 및 그 동작 방법 | |
CN113313742A (zh) | 图像深度估计方法、装置、电子设备及计算机存储介质 | |
CN112766348A (zh) | 一种基于对抗神经网络生成样本数据的方法以及装置 | |
JP2021144428A (ja) | データ処理装置、データ処理方法 | |
KR20180075220A (ko) | 멀티미디어 신호의 프로세싱 방법, 장치 및 시스템 | |
JP2017126264A (ja) | 情報処理装置、情報処理方法およびプログラム | |
JP2016519343A (ja) | 他の画像からの情報の関数に基づく汎関数を利用する目的画像の生成 | |
CN111242299A (zh) | 基于ds结构的cnn模型压缩方法、装置及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22770822 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18271649 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023506783 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22770822 Country of ref document: EP Kind code of ref document: A1 |