WO2022137476A1 - 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 - Google Patents
物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 Download PDFInfo
- Publication number
- WO2022137476A1 WO2022137476A1 PCT/JP2020/048617 JP2020048617W WO2022137476A1 WO 2022137476 A1 WO2022137476 A1 WO 2022137476A1 JP 2020048617 W JP2020048617 W JP 2020048617W WO 2022137476 A1 WO2022137476 A1 WO 2022137476A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature map
- unit
- feature
- object detection
- feature amount
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 318
- 238000000034 method Methods 0.000 title claims description 115
- 238000012806 monitoring device Methods 0.000 title claims description 29
- 238000000605 extraction Methods 0.000 claims abstract description 104
- 238000004458 analytical method Methods 0.000 claims description 54
- 238000013527 convolutional neural network Methods 0.000 claims description 20
- 230000005856 abnormality Effects 0.000 claims description 17
- 238000010801 machine learning Methods 0.000 claims description 11
- 238000012732 spatial analysis Methods 0.000 claims description 5
- 238000013135 deep learning Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 description 67
- 230000015654 memory Effects 0.000 description 59
- 230000006870 function Effects 0.000 description 47
- 238000013528 artificial neural network Methods 0.000 description 37
- 238000010586 diagram Methods 0.000 description 37
- 238000012544 monitoring process Methods 0.000 description 19
- 238000004364 calculation method Methods 0.000 description 10
- 238000013500 data storage Methods 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 4
- 239000003086 colorant Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000003708 edge detection Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01J—MEASUREMENT OF INTENSITY, VELOCITY, SPECTRAL CONTENT, POLARISATION, PHASE OR PULSE CHARACTERISTICS OF INFRARED, VISIBLE OR ULTRAVIOLET LIGHT; COLORIMETRY; RADIATION PYROMETRY
- G01J5/00—Radiation pyrometry, e.g. infrared or optical thermometry
- G01J5/0022—Radiation pyrometry, e.g. infrared or optical thermometry for sensing the radiation of moving bodies
- G01J5/0025—Living bodies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
- G06T2207/30261—Obstacle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/08—Detecting or categorising vehicles
Definitions
- This disclosure relates to an object detection device, a monitoring device, a learning device, and a model generation method.
- Non-Patent Document 1 discloses SSD.
- the present disclosure has been made to solve the above-mentioned problems, and an object thereof is to realize the detection of a small object.
- the object detection device uses an image data acquisition unit that acquires image data showing an image captured by a camera, a first feature amount extraction unit that generates a first feature map using the image data, and image data.
- the second feature map is generated, and the second feature map is added or multiplied using the first feature map, and the second feature map is weighted to generate the third feature map.
- a second feature amount extraction unit and an object detection unit that detects an object in a captured image using a third feature map are provided, and the first feature amount in the first feature map has a medium-level feature corresponding to the object-likeness.
- the second feature amount in the second feature map is the one using the high level feature.
- FIG. 1st feature amount extraction part shows the main part of the 1st feature amount extraction part, the 2nd feature amount extraction part, and the object detection part in the object detection apparatus which concerns on Embodiment 1.
- FIG. It is explanatory drawing shows the example of the class classified by the object detection part in the object detection apparatus which concerns on Embodiment 1.
- FIG. It is a block diagram which shows the other hardware composition of the main part of the object detection apparatus which concerns on Embodiment 1.
- FIG. It is a block diagram which shows the hardware composition of the main part of the learning apparatus which concerns on Embodiment 1.
- FIG. It is a block diagram which shows the other hardware composition of the main part of the learning apparatus which concerns on Embodiment 1.
- FIG. It is a flowchart which shows the operation of the object detection apparatus which concerns on Embodiment 1.
- FIG. It is a flowchart which shows the operation of the learning apparatus which concerns on Embodiment 1.
- It is explanatory drawing which shows the structure of the 1st neural network. It is explanatory drawing which shows the structure of each prominence block layer.
- FIG. It is explanatory drawing which shows the example of the detection result by the object detection apparatus which concerns on Embodiment 1.
- FIG. It is explanatory drawing which shows the example of the detection accuracy by the object detection apparatus for comparison, and the example of the detection accuracy by the object detection apparatus which concerns on Embodiment 1.
- FIG. It is explanatory drawing which shows the other example of the detection accuracy by the object detection apparatus for comparison, and another example of the detection accuracy by the object detection apparatus which concerns on Embodiment 1.
- FIG. It is explanatory drawing which shows the other example of the detection accuracy by the object detection apparatus for comparison, and another example of the detection accuracy by the object detection apparatus which concerns on Embodiment 1.
- FIG. It is explanatory drawing which shows the other example of the detection accuracy by the object detection apparatus for comparison, and another example of the detection accuracy by the object detection apparatus which concerns on Embodiment 1.
- FIG. 1 shows an example of the image of the thermal map as the 1st feature map generated by the 1st feature map generation part using the temperature image corresponding to each captured image.
- FIG. 2 shows the main part of the object detection system including the object detection apparatus which concerns on Embodiment 2.
- FIG. 2 is a block diagram which shows the main part of the learning system including the learning apparatus which concerns on Embodiment 2.
- FIG. 2 It is a flowchart which shows the operation of the object detection apparatus which concerns on Embodiment 2.
- It is a block diagram which shows the main part of the object detection system including the object detection apparatus which concerns on Embodiment 3.
- FIG. It is a block diagram which shows the main part of the learning system including the learning apparatus which concerns on Embodiment 3.
- FIG. 1 It is a flowchart which shows the operation of the object detection apparatus which concerns on Embodiment 3. It is a block diagram which shows the main part of the monitoring system including the monitoring apparatus which concerns on Embodiment 4. It is a block diagram which shows the main part of the analysis part and the output control part in the monitoring apparatus which concerns on Embodiment 4. FIG. It is explanatory drawing which shows the example of the risk map image. It is a block diagram which shows the hardware composition of the main part of the monitoring apparatus which concerns on Embodiment 4. FIG. It is a block diagram which shows the other hardware composition of the main part of the monitoring apparatus which concerns on Embodiment 4. FIG. It is a flowchart which shows the operation of the monitoring apparatus which concerns on Embodiment 4. It is a block diagram which shows the main part of the monitoring system including other monitoring apparatus which concerns on Embodiment 4. FIG. It is a block diagram which shows the main part of the monitoring system including other monitoring apparatus which concerns on Embodiment 4. FIG. FIG. It is a block diagram which shows
- FIG. 1 is a block diagram showing a main part of an object detection system including the object detection device according to the first embodiment.
- FIG. 2 is a block diagram showing a main part of a first feature amount extraction unit, a second feature amount extraction unit, and an object detection unit in the object detection device according to the first embodiment.
- An object detection system including the object detection device according to the first embodiment will be described with reference to FIGS. 1 and 2.
- the object detection system 100 includes a camera 1, a storage device 2, and an object detection device 200.
- the storage device 2 has a feature map storage unit 11.
- the object detection device 200 includes an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, and an object detection unit 24.
- the camera 1 is composed of, for example, a surveillance camera, a security camera, or a camera for an electronic mirror. That is, the camera 1 is composed of a camera for capturing a moving image.
- the storage device 2 is composed of a memory.
- the camera 1 is configured by a camera for an electronic mirror
- the camera 1, the storage device 2, and the object detection device 200 are provided in a vehicle (not shown).
- a vehicle not shown
- own vehicle such a vehicle may be referred to as "own vehicle”.
- the first feature amount extraction unit 22 has a first feature map generation unit 31.
- the second feature amount extraction unit 23 is configured by the first neural network NN1.
- the first neural network NN1 has a second feature map generation unit 32 and a third feature map generation unit 33.
- the object detection unit 24 is configured by the second neural network NN2.
- the second neural network NN2 has a position estimation unit 34 and a type estimation unit 35.
- the image data acquisition unit 21 acquires image data indicating an image captured by the camera 1. That is, the image data acquisition unit 21 acquires image data showing individual still images (hereinafter, may be referred to as “captured images”) constituting the moving image captured by the camera 1.
- the first feature map generation unit 31 uses the image data acquired by the image data acquisition unit 21 to generate one feature map (hereinafter referred to as “first feature map”) FM1 corresponding to each captured image. It is something to do.
- the first feature map FM1 is composed of a plurality of feature quantities (hereinafter referred to as "first feature quantities") arranged in a two-dimensional manner. Each first feature amount uses a medium-level feature (Mid-level Feature) corresponding to the object-likeness (Objectness).
- the "medium level” in the medium level feature is the same level as the level based on the human visual model. That is, such "medium level” is a level lower than the level of features used in conventional object detection.
- each first feature amount uses salience.
- the first feature map generation unit 31 generates a salience map (Salience Map) by executing salience estimation (Salience Estimation). At this time, the first feature map generation unit 31 generates a saliency map by the same method as that described in Reference 1 below, for example. That is, the first feature map generation unit 31 generates a saliency map by the same generation method as the generation method by the image feature map generation unit in the object detection device described in Reference 1.
- the saliency map is directly generated using the image data acquired by the image data acquisition unit 21 without going through other feature maps. Also, a saliency map is generated without the use of CNN.
- the fourth feature map generation unit 36 has a plurality of feature maps corresponding to the first feature map FM1 from the first feature map FM1 generated by the first feature map generation unit 31 (hereinafter referred to as “fourth feature map”). .) Generates FM4. Specifically, the fourth feature map generation unit 36 performs convolution to generate a plurality of fourth feature map FM4s. Each fourth feature map FM4 is composed of a plurality of feature quantities (hereinafter referred to as "fourth feature quantities”) arranged in a two-dimensional manner. The individual fourth feature amount uses a middle level feature (Middle-level Feature).
- the first feature map generation unit 31 and the fourth feature map generation unit 36 are free to learn by unsupervised learning. That is, the first feature amount extraction unit 22 is free to learn by unsupervised learning.
- Various known techniques can be used for such unsupervised learning. Detailed description of these techniques will be omitted.
- the second feature map generation unit 32 uses the image data acquired by the image data acquisition unit 21 to generate a plurality of feature maps (hereinafter referred to as “second feature map”) FM2 corresponding to each captured image. It is something to do.
- Each second feature map FM2 is composed of a plurality of feature quantities (hereinafter referred to as "second feature quantities") arranged in a two-dimensional manner.
- Each second feature quantity uses a high-level feature (High-level Features).
- the "high level” in the high-level feature is the same level as the level of the feature used for the conventional object detection. That is, such a “high level” is a higher level than a level based on a human visual model.
- the CNN is composed of the parts corresponding to the second feature map generation unit 32 in the first neural network NN1.
- a plurality of second feature maps FM2 are sequentially generated.
- the third feature map generation unit 33 adds or multiplies the second feature map FM2 using the first feature map FM1 to weight the second feature map, whereby the plurality of second feature maps FM2 A plurality of feature maps based on (hereinafter referred to as "third feature map") FM3 are generated.
- third feature map a plurality of feature maps based on
- the third feature map generation unit 33 adds the individual first feature quantities in the first feature map FM1 to the corresponding second feature quantities in the individual second feature map FM2. Specifically, the third feature map generation unit 33 first duplicates one first feature map FM1 by the number of the second feature map FM2. Then, the third feature map generation unit 33 associates the duplicated first feature map FM1 with each of the individual second feature map FM2, and adds the duplicated first feature map FM1 for each layer in pixel units. That is, the third feature map generation unit 33 spatially adds the first feature map FM1 and the second feature map FM2. As a result, the third feature map generation unit 33 weights the second feature map FM2 using the first feature map FM1. That is, the third feature map generation unit 33 weights the corresponding second feature amount in each second feature map FM2.
- the third feature map generation unit 33 multiplies each first feature amount in the first feature map FM1 by the corresponding second feature amount in each second feature map FM2. Specifically, the third feature map generation unit 33 first duplicates one first feature map FM1 by the number of the second feature map FM2. Then, the third feature map generation unit 33 associates the duplicated first feature map FM1 with each second feature map FM2, and multiplies each layer in pixel units. That is, the third feature map generation unit 33 spatially multiplies the first feature map FM1 and the second feature map FM2. As a result, the third feature map generation unit 33 weights the second feature map FM2 using the first feature map FM1. That is, the third feature map generation unit 33 weights the corresponding second feature amount in each second feature map FM2.
- a plurality of fourth feature map generation units 36 of the first feature amount extraction unit 22 correspond to the first feature map from the first feature map FM1 generated by the first feature map generation unit 31. It is assumed that the fourth feature map FM4 is generated.
- the third feature map generation unit 33 adds the individual fourth feature quantities in the fourth feature map FM4 to the corresponding second feature quantities in the second feature map FM2 corresponding to the fourth feature map.
- the third feature map generation unit 33 associates each fourth feature map FM4 with each second feature map FM2, and adds them for each layer on a pixel-by-pixel basis. That is, the third feature map generation unit 33 spatially adds the fourth feature map FM4 and the second feature map FM2.
- the third feature map generation unit 33 uses the first feature map FM1, and more specifically, the second feature map using the fourth feature map FM4 generated by using the first feature map FM1. Weighting for FM2. That is, the third feature map generation unit 33 weights the corresponding second feature amount in each second feature map FM2.
- the fourth feature map generation unit 36 of the first feature amount extraction unit 22 is the first feature generated by the first feature map generation unit 31. It is assumed that a plurality of fourth feature maps FM4 corresponding to the first feature map are generated from the map FM1. For example, the third feature map generation unit 33 multiplies each fourth feature amount in the fourth feature map FM4 by the corresponding second feature amount in each second feature map FM2. Specifically, the third feature map generation unit 33 associates each fourth feature map FM4 with each second feature map FM2, and multiplies each layer on a pixel-by-pixel basis.
- the third feature map generation unit 33 spatially multiplies the fourth feature map FM4 and the second feature map FM2.
- the third feature map generation unit 33 uses the first feature map FM1, and more specifically, the second feature map using the fourth feature map FM4 generated by using the first feature map FM1. Weighting for FM2. That is, the third feature map generation unit 33 weights the corresponding second feature amount in each second feature map FM2.
- the third feature map generation unit 33 adds the first feature map FM1 in the dimensional direction of the plurality of second feature maps FM2, in other words, in the channel direction.
- the third feature map generation unit 33 concatenates the first feature map FM1 in the dimensional direction of the plurality of second feature maps FM2.
- the third feature map generation unit 33 duplicates one first feature map FM1 by the number of, for example, the second feature map FM2.
- the third feature map generation unit 33 adds the duplicated first feature map FM1 in the dimensional direction of the plurality of second feature maps FM2.
- the third feature map generation unit 33 weights the second feature map FM2 using the first feature map FM1. That is, the third feature map generation unit 33 weights each second feature map FM2 to increase the number of dimensions.
- the third feature map generation unit 33 is described in the above ⁇ generation method by addition (1)>, ⁇ generation method by multiplication (1)>, ⁇ generation method by addition (2)>, and ⁇ generation method by multiplication (2)>.
- weighting a value indicating the weight given to each second feature amount based on at least one of structural similarity (SSIM (Structual Similarity)) and image similarity correlation (hereinafter referred to as "importance").
- SSIM Structuretual Similarity
- importance image similarity correlation
- the third feature map generation unit 33 sets the importance W to a larger value as the SIMM index is larger. Further, for example, the third feature map generation unit 33 sets the importance W to a larger value as the index of the correlation similarity becomes larger.
- the third feature map generation unit 33 uses the third feature map FM3 to capture an image. It is possible to improve the object detection accuracy in.
- the object detection unit 24 detects an object in the captured image using the third feature map FM3.
- the first feature amount extraction unit 22 can be configured not to include the fourth feature map generation unit 36.
- each second feature amount is reinforced according to the corresponding object-likeness. That is, the second feature amount corresponding to the higher object-likeness is relatively stronger than the second feature amount corresponding to the lower object-likeness. On the other hand, the second feature amount corresponding to the lower object-likeness is relatively weakened as compared with the second feature amount corresponding to the higher object-likeness.
- Each third feature map FM3 is based on a plurality of such reinforced feature quantities (hereinafter, may be referred to as "third feature quantity").
- the individual third feature map FM3 is reinforced with a plurality of feature quantities (first feature quantity) in the dimensional direction while maintaining the spatial independence of the individual second feature quantities of the second feature map FM2. It is based on the individual second feature amount and the individual first feature amount.
- the individual second feature amount and the individual first feature amount constituting the individual third feature map FM3 generated by the ⁇ generation method (3) by addition> are hereinafter referred to as "third feature amount". be.
- the first neural network NN1 is free to learn by supervised learning. That is, the second feature amount extraction unit 23 is free to learn by supervised learning.
- the first neural network NN1 includes a CNN. That is, the second feature amount extraction unit 23 includes a CNN. Therefore, the second feature amount extraction unit 23 can be freely learned by deep learning.
- the structure of the first neural network NN1 will be described later with reference to FIGS. 11 to 12.
- the feature map storage unit 11 temporarily stores the generated second feature map FM2 when each second feature map FM2 is generated by the second feature map generation unit 32. Since the feature map storage unit 11 is provided outside the second feature quantity extraction unit 23, it is possible to improve the efficiency of using the storage capacity.
- the object detection unit 24 detects an individual object in each captured image by using a plurality of third feature map FM3s generated by the third feature map generation unit 33. More specifically, the position estimation unit 34 estimates the position of each object by regression, and the type estimation unit 35 estimates the type of each object by classification. That is, the second neural network NN2 is free to learn by supervised learning. In other words, the object detection unit 24 is free to learn by supervised learning.
- the object detection unit 24 detects individual objects by SSD.
- the second neural network NN2 is configured by a neural network similar to the neural network in the subsequent stage after "VGG-16" in the SSD described in Non-Patent Document 1 (Fig. 2 of Non-Patent Document 1 and the like). reference.). That is, the second neural network NN2 is composed of a neural network including a neural network similar to "Extra Features Layers" in SSD described in Non-Patent Document 1.
- the neural network executes a plurality of convolution operations. As a result, the position of each object is estimated, and the type of each object is estimated.
- the multiple convolution operations are due to different kernel sizes. More specifically, the kernel size is getting smaller and smaller. This makes it possible to deal with fluctuations in the size of individual objects in the captured image. That is, it is possible to realize object detection by so-called "multi-scale”.
- FIG. 3 shows an example of the type estimated by the type estimation unit 35. That is, FIG. 3 shows an example of a class classified by the type estimation unit 35.
- cars indicates a vehicle traveling in the same direction as the traveling direction of the own vehicle.
- lage vehicles indicates a large vehicle traveling in the same direction as the traveling direction of the own vehicle.
- motorbikes indicates a motorcycle traveling in the same direction as the traveling direction of the own vehicle. That is, these classes indicate other vehicles traveling in the same direction as the traveling direction of the own vehicle. In other words, these classes refer to following or overtaking vehicles.
- cars (opposite direction) indicates a vehicle traveling in the direction opposite to the traveling direction of the own vehicle.
- large vehicles (opposite directions) indicates a large vehicle traveling in the direction opposite to the traveling direction of the own vehicle.
- motorbikes (opposite direction)” indicates a motorcycle traveling in the direction opposite to the traveling direction of the own vehicle. That is, these classes indicate other vehicles traveling in the direction opposite to the traveling direction of the own vehicle. In other words, these classes represent oncoming vehicles.
- the class classified by the type estimation unit 35 includes the traveling direction of each object. That is, the type estimated by the type estimation unit 35 includes the traveling direction of each object. This makes it unnecessary to determine the traveling direction in the subsequent processing for the object detection unit 24. As a result, it is possible to reduce the amount of calculation in the subsequent processing for the object detection unit 24.
- FIG. 4 is a block diagram showing a main part of a learning system including the learning device according to the first embodiment.
- a learning system including the learning device according to the first embodiment will be described with reference to FIG.
- the same blocks as those shown in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.
- the learning system 300 includes a storage device 2, a storage device 3, and a learning device 400.
- the storage device 2 has a feature map storage unit 11.
- the storage device 3 has an image data storage unit 12.
- the learning device 400 has an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, an object detection unit 24, and a learning unit 25.
- the storage device 3 is composed of a memory.
- the image data storage unit 12 stores a database (hereinafter referred to as “learning image database”) including a plurality of learning images (hereinafter sometimes referred to as “learning images”).
- the image data acquisition unit 21 in the learning device 400 acquires image data indicating individual learning images instead of acquiring image data indicating individual captured images.
- the first feature amount extraction unit 22, the second feature amount extraction unit 23, and the object detection unit 24 in the learning device 400 are the first feature amount extraction unit 22, the second feature amount extraction unit 23, and the object detection unit in the object detection device 200. It is the same as 24. Therefore, detailed description thereof will be omitted.
- the learning unit 25 learns the second feature amount extraction unit 23 by supervised learning (more specifically, deep learning) based on the detection result by the object detection unit 24. Further, the learning unit 25 learns the object detection unit 24 by supervised learning based on the detection result by the object detection unit 24.
- the learning unit 25 acquires data indicating a correct answer related to object detection corresponding to the learning image indicated by the image data acquired by the image data acquisition unit 21 (hereinafter referred to as “correct answer data”).
- the correct answer data is input in advance by a person (for example, the manufacturer of the object detection device 200 or the service provider using the object detection system 100).
- the learning unit 25 compares the detection result by the object detection unit 24 with the correct answer indicated by the acquired correct answer data. Based on the result of the comparison, the learning unit 25 updates the parameters in the first neural network NN1 as needed, and updates the parameters in the second neural network NN2 as needed.
- Various known techniques can be used to update such parameters. Detailed description of these techniques will be omitted.
- the learning unit 25 takes the image data acquired by the image data acquisition unit 21 as input and outputs the detection result of each object in each captured image (hereinafter referred to as "machine learning model"). To generate.
- the machine learning model a plurality of parameter sets are set.
- the individual parameter sets include trained parameters for the first neural network NN1 and include trained parameters for the second neural network NN2.
- the detection result of each object in each captured image is specifically the estimation result of the position of each object in each captured image and the estimation result of the type of each object.
- the machine learning model is stored, for example, in a storage device (not shown).
- the code of "F1" may be used for the function of the image data acquisition unit 21.
- the reference numeral of "F2” may be used for the function of the first feature amount extraction unit 22.
- the reference numeral of "F3” may be used for the function of the second feature amount extraction unit 23.
- the reference numeral of "F4" may be used for the function of the object detection unit 24.
- the code of "F5" may be used for the function of the learning unit 25.
- the processes executed by the image data acquisition unit 21 may be collectively referred to as “image data acquisition process”.
- the processes executed by the first feature amount extraction unit 22 may be collectively referred to as “first feature amount extraction process”.
- the processes executed by the second feature amount extraction unit 23 may be collectively referred to as “second feature amount extraction process”.
- the processes executed by the object detection unit 24 may be collectively referred to as “object detection process”.
- the processes executed by the learning unit 25 may be collectively referred to as "learning processes”.
- the object detection device 200 has a processor 41 and a memory 42.
- the memory 42 stores programs corresponding to a plurality of functions F1 to F4.
- the processor 41 reads and executes the program stored in the memory 42. As a result, a plurality of functions F1 to F4 are realized.
- the object detection device 200 has a processing circuit 43.
- a plurality of functions F1 to F4 are realized by the dedicated processing circuit 43.
- the object detection device 200 has a processor 41, a memory 42, and a processing circuit 43 (not shown).
- some of the functions of the plurality of functions F1 to F4 are realized by the processor 41 and the memory 42, and the remaining functions of the plurality of functions F1 to F4 are realized by the dedicated processing circuit 43. Will be done.
- the processor 41 is composed of one or more processors.
- a CPU Central Processing Unit
- a GPU Graphics Processing Unit
- a microprocessor a microcontroller
- DSP Digital Signal Processor
- the memory 42 is composed of one or more non-volatile memories.
- the memory 42 is composed of one or more non-volatile memories and one or more volatile memories. That is, the memory 42 is composed of one or more memories.
- the individual memory uses, for example, a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, or a magnetic tape.
- each volatile memory uses, for example, a RAM (Random Access Memory).
- the individual non-volatile memory is, for example, a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Read Only Memory), an EPROM (Electrically Erasable Programmory), a flexible disk drive A compact disc, a DVD (Digital Versaille Disc), a Blu-ray disc, or a mini disc is used.
- the processing circuit 43 is composed of one or more digital circuits.
- the processing circuit 43 is composed of one or more digital circuits and one or more analog circuits. That is, the processing circuit 43 is composed of one or more processing circuits.
- the individual processing circuits are, for example, ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), System LSI (Sy), and System (Sy). Is.
- the processing circuit 43 is composed of a plurality of processing circuits
- the correspondence between the plurality of functions F1 to F4 and the plurality of processing circuits is arbitrary.
- the object detection device 200 may have a plurality of processing circuits having a one-to-one correspondence with a plurality of functions F1 to F4.
- each of the plurality of functions F1 to F4 may be realized exclusively by the corresponding one processing circuit among the plurality of processing circuits.
- the learning device 400 has a processor 44 and a memory 45.
- the memory 45 stores programs corresponding to a plurality of functions F1 to F5.
- the processor 44 reads and executes the program stored in the memory 45. As a result, a plurality of functions F1 to F5 are realized.
- the learning device 400 has a processing circuit 46.
- a plurality of functions F1 to F5 are realized by the dedicated processing circuit 46.
- the learning device 400 has a processor 44, a memory 45, and a processing circuit 46 (not shown).
- some of the functions of the plurality of functions F1 to F5 are realized by the processor 44 and the memory 45, and the remaining functions of the plurality of functions F1 to F5 are realized by the dedicated processing circuit 46. Will be done.
- the processor 44 is composed of one or more processors.
- the individual processors use, for example, CPUs, GPUs, microprocessors, microcontrollers or DSPs.
- the memory 45 is composed of one or more non-volatile memories.
- the memory 45 is composed of one or more non-volatile memories and one or more volatile memories. That is, the memory 45 is composed of one or more memories.
- the individual memory uses, for example, a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, or a magnetic tape.
- each volatile memory uses, for example, RAM.
- non-volatile memory for example, ROM, flash memory, EPROM, EEPROM, solid state drive, hard disk drive, flexible disk, compact disk, DVD, Blu-ray disk or mini disk are used.
- the processing circuit 46 is composed of one or more digital circuits. Alternatively, the processing circuit 46 is composed of one or more digital circuits and one or more analog circuits. That is, the processing circuit 46 is composed of one or more processing circuits.
- the individual processing circuits use, for example, an ASIC, PLD, FPGA, SoC or system LSI.
- the processing circuit 46 is composed of a plurality of processing circuits
- the correspondence between the plurality of functions F1 to F5 and the plurality of processing circuits is arbitrary.
- the learning device 400 may have a plurality of processing circuits having a one-to-one correspondence with a plurality of functions F1 to F5.
- each of the plurality of functions F1 to F5 may be realized exclusively by the corresponding one processing circuit among the plurality of processing circuits.
- the image data acquisition unit 21 executes the image data acquisition process (step ST1).
- the first feature amount extraction unit 22 executes the first feature amount extraction process (step ST2).
- the second feature amount extraction unit 23 executes the second feature amount extraction process (step ST3).
- the object detection unit 24 executes the object detection process (step ST4).
- the image data acquisition unit 21 executes the image data acquisition process (step ST11).
- the first feature amount extraction unit 22 executes the first feature amount extraction process (step ST12).
- the second feature amount extraction unit 23 executes the second feature amount extraction process (step ST13).
- the object detection unit 24 executes the object detection process (step ST14).
- the learning unit 25 executes the learning process (step ST15).
- the first neural network NN1 has a plurality of prominence block layers L1.
- “Input image” indicates an captured image or a learning image indicated by the image data acquired by the image data acquisition unit 21.
- “Saliency Map” indicates the first feature map FM1 generated by the first feature map generation unit 31.
- “Fature Map” indicates an individual third feature map FM3 generated by the third feature map generation unit 33.
- the individual prominence block layer L1 is a 3 ⁇ 3 convolution layer L11, a BN (Batch Normalization) layer L12, an ELU (Exponential Liner Unit) layer L13, a maximum pooling layer L14, and a prominence guide layer L15. have.
- the CNN in the first neural network NN1 uses, for example, a VGG network.
- the VGG network may have BN added.
- the CNN in the first neural network NN1 may be, for example, one using a residual network (Residal Network), or one using DenseNet or MobileNet. Further, the CNN in the first neural network NN1 may be, for example, one using the technique described in Reference 2 below.
- the corresponding second feature map FM2 among the plurality of second feature map FM2 is generated in the saliency block layer L1.
- the generated second feature map FM2 is weighted. That is, addition or multiplication is performed on each second feature map FM2 using the first feature map FM1, and weighting is performed on each second feature map FM2 by the first feature map FM1.
- FIGS. 13 to 21 are diagrams for explaining an image in which the individual second feature map FM2 is weighted in the saliency block layer L1 and the third feature map FM3 is generated.
- "Input image” in the figure indicates an captured image or a learning image indicated by the image data acquired by the image data acquisition unit 21.
- the camera 1 is configured by a camera for an electronic mirror and is provided in the vehicle.
- the camera 1 is acquired by the image data acquisition unit 21 for convenience.
- the image data is, for example, image data captured by a camera 1 configured by a surveillance camera that images the coast.
- “Saliency Map” indicates the first feature map FM1 generated by the first feature map generation unit 31.
- “Fature Map” is an individual second feature map FM2 generated by the second feature map generation unit 32, an individual third feature map FM3 generated by the third feature map generation unit 33, or a fourth.
- the individual fourth feature map FM4 generated by the feature map generation unit 36 is shown.
- FIG. 13 is a diagram for explaining an image in which the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (1) by addition>.
- FIG. 14 is a diagram for explaining an image in which the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (1) by multiplication>.
- the first feature map FM1 is used to generate the corresponding second feature map FM2 out of the plurality of second feature map FM2.
- the generated second feature map FM2 is weighted, and the image in which the third feature map FM3 is generated is shown. As shown in FIGS.
- each first feature map FM1 in each first feature map FM1, the region corresponding to the object to be detected (here, a person) is activated. On the first feature map FM1, a large value is set for the first feature amount of the activated region. In the first feature map FM1, the region corresponding to a small object existing in the distance is also activated.
- the second feature map FM2 for example, a small object existing in the distance is not detected and becomes a background.
- the second feature map FM2 and the first feature map FM1 are added or multiplied, and the first feature amount is spatially added or multiplied to the second feature amount.
- weighting is performed and the importance W is set.
- the second feature map FM2 becomes a feature map capable of detecting a small object where the small object was not detected and was the background.
- the information is meaningless, unnecessary information, or redundant information in the object detection on the second feature map FM2, it may appear as a feature amount.
- the feature amount is meaningless, unnecessary, or redundant feature amount, and is information that hinders learning.
- a foreground object such as a person or a vehicle
- background objects such as the sea or a building
- the second feature map FM2 and the first feature map FM1 are multiplied, and the first feature amount is spatially multiplied by the second feature amount, which is redundant.
- the second feature is truncated.
- “0" is set for the first feature amount which is meaningless in object detection. By multiplying by "0", the second feature amount becomes "0". This can prevent the learning of the foreground object from being hindered.
- FIG. 15 is a diagram for explaining an image in which the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (2) by addition>.
- FIG. 16 is a diagram for explaining an image in which the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (2) by multiplication>.
- 15 and 16 show, for example, that the first feature map FM1 is used to generate the corresponding second feature map FM2 of the plurality of second feature maps FM2 only in the first layer of salency block layer L1.
- the generated second feature map FM2 is weighted, and the image in which the third feature map FM3 is generated is shown. As shown in FIGS.
- the region corresponding to the object to be detected (here, a person) is activated.
- a plurality of fourth feature maps FM4 are generated from the first feature map FM1. Since the plurality of fourth feature maps FM4 are generated by convolution, they are feature maps having different ways of taking feature quantities.
- the convolution calculation performed by the fourth feature map generation unit 36 to generate a plurality of fourth feature maps FM4 is when the second feature amount extraction unit 23 generates a plurality of second feature maps FM2. It is the same as the operation content of the convolution performed in.
- FIGS. 13 to 16 show ⁇ the generation method by addition (1)> and ⁇ the generation method by multiplication, respectively, only in the first layer of the prominent block layer L1 among the individual prominent block layers L1.
- 1)>, ⁇ generation method by addition (2)> and ⁇ generation method by addition (2)> were used as an image to generate the third feature map FM3.
- the third feature map FM3 is, for example, in each prominence block layer L1, ⁇ generation method by addition (1)>, ⁇ generation method by multiplication (1)>, ⁇ generation method by addition (2). > Or ⁇ Generation method by addition (2)> may be used to generate.
- FIG. 17 is a diagram showing an image in which a third feature map FM3 is generated in each prominence block layer L1 by using the above-mentioned ⁇ generation method (1) by addition>.
- a third feature map FM3 as shown in the image in FIG. 17 is generated.
- FIG. 18 is a diagram showing an image in which a third feature map FM3 is generated in each prominence block layer L1 by using the above-mentioned ⁇ generation method (1) by multiplication>.
- a third feature map FM3 as shown in the image in FIG. 18 is generated.
- FIG. 19 is a diagram showing an image in which a third feature map FM3 is generated in each prominence block layer L1 by using the above-mentioned ⁇ generation method (2) by addition>.
- a third feature map FM3 as shown in the image in FIG. 19 is generated.
- FIG. 20 is a diagram showing an image in which a third feature map FM3 is generated in each prominence block layer L1 by using the above-mentioned ⁇ generation method (2) by multiplication>.
- a third feature map FM3 as shown in the image in FIG. 20 is generated.
- FIG. 21 is a diagram for explaining an image in which the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (3) by addition>.
- FIG. 21 shows an image in which the third feature map FM3 is generated by the above-mentioned ⁇ generation method (3) by addition> in each prominence block layer L1.
- the individual first feature map FM1 in which the region corresponding to the object to be detected (here, a person) is activated is added after the plurality of second feature maps FM2 in the dimensional direction.
- ⁇ Generation method (3) by addition> does not spatially add the first feature amount to the second feature amount, but weights the second feature map FM2 by increasing the variation of the feature map. This is the intended method.
- the first feature map FM1 and the second feature map FM2 are 500-dimensional feature maps, respectively.
- the generated third feature map FM3 is a 500-dimensional feature map, and the number in the dimensional direction does not change.
- the generated third feature map FM3 is a 1000-dimensional feature map. That is, the number of feature maps increases in the dimensional direction.
- the generated 1000-dimensional third feature map FM3 is further convolved in the next prominence block layer L1 to generate a third feature map FM3 with a richer variation in features.
- SE Sudreze-and-Excitation
- VGG + BN VGG + BN + SE
- ResNet + SE ResNet + SE
- the reference numeral "200'_1” is used for a conventional object detection device (not shown) having a feature amount extraction unit by VGG and an object detection unit by SSD.
- the reference numeral "200'_2" is used for a conventional object detection device (not shown) having a feature amount extraction unit by VGG + BN + SE or ResNet + SE and an object detection unit by SSD. That is, these object detection devices 200 ′ _1 and 200 ′ _2 are comparison targets with respect to the object detection device 200. Further, these object detection devices 200'_1 and 200'_2 do not have a portion corresponding to the first feature map generation unit 31 and do not have a portion corresponding to the third feature map generation unit 33. It is a thing.
- the range including the medium size is referred to as “Media”. Further, the range including the size smaller than the size included in the medium is called “Small”. Further, the range including the size larger than the size included in the medium is called “Large”. Specifically, for example, Small is a range containing an object having a size smaller than 32 ⁇ 32 pixels, and Medium is a range containing an object having a size larger than 32 ⁇ 32 pixels and smaller than 96 ⁇ 96 pixels. Large is a range that includes objects with a size larger than 96 x 96 pixels.
- a data set based on CMS-DD (Camera Monitoring System Driving Dataset), in which only two of the eight classes shown in FIG. 3 are included in the classification target, is described as "2class”. do.
- a CMS-DD data set in which only 4 classes out of the 8 classes shown in FIG. 3 are included in the classification target is described as "4class”.
- a CMS-DD data set including the eight classes shown in FIG. 3 as a classification target is described as "8 class”.
- FIG. 22 shows an example of a captured image.
- FIG. 23 shows an example of a feature map corresponding to the first feature map FM1 generated by the object detection device 200 when the image data showing the captured image shown in FIG. 22 is input to the object detection device 200. .. More specifically, FIG. 23 shows an example of a feature map corresponding to the saliency map generated by the object detection device 200.
- FIG. 24 is one of a plurality of feature maps FM'generated by the object detection device 200'_2 when the image data showing the captured image shown in FIG. 22 is input to the object detection device 200'_2.
- An example of a feature map corresponding to the feature map FM' is shown. More specifically, FIG. 24 shows an example of a feature map corresponding to the first feature map FM'of the plurality of feature map FM'.
- FIG. 25 shows the third of one of the plurality of third feature maps FM3 generated by the object detection device 200 when the image data showing the captured image shown in FIG. 22 is input to the object detection device 200.
- An example of a feature map corresponding to the feature map FM3 is shown. More specifically, FIG. 25 shows an example of a feature map corresponding to the first third feature map FM3 among the plurality of third feature map FM3s.
- a region different from the region corresponding to the object to be detected (that is, another vehicle) is activated. More specifically, the area of the background corresponding to the sky is activated.
- the region corresponding to the object to be detected (that is, another vehicle) is activated. This is due to the weighting using the saliency map corresponding to the feature map shown in FIG. 23.
- the feature map ignited in a wide area as a global feature is evaluated as having a better feature. For this reason, it does not actually go into the meaning of the ignited area. For this reason, in object detection, a method in which weighting is performed based on features derived from an object such as prominence is superior.
- the weighted third feature map FM3 for object detection compared to the case where the feature map FM'is used for object detection (that is, when the first feature map FM1 before weighting is used for object detection). In comparison), the following effects can be obtained.
- the accuracy of object detection can be improved.
- the context related to the object-likeness is taken into consideration, the occurrence of erroneous detection can be suppressed.
- the feature amount extraction unit that is, the second feature amount extraction unit 23
- each feature map that is, the individual second feature map FM2 and the individual third feature map FM3
- the size of each feature map can be increased while avoiding an explosive increase in the amount of calculation. As a result, it is possible to realize the detection of a small object.
- the object detection device 200 when used for an electronic mirror, it is required to use an in-vehicle processor 41 or a processing circuit 43. That is, it is required to use an inexpensive processor 41 or a processing circuit 43. In other words, it is required to use a processor 41 or a processing circuit 43 having a low computing power. On the other hand, in this case, it is required to realize the detection of a small object from the viewpoint of detecting another vehicle or the like traveling at a position far from the position of the own vehicle. On the other hand, by using the object detection device 200, the amount of calculation can be reduced and the detection of a small object can be realized.
- FIG. 26 shows an example of the detection result by the object detection device 200''2 related to the captured image shown in FIG. 22.
- FIG. 27 shows an example of the detection result by the object detection device 200 related to the captured image shown in FIG. 22.
- the object detection device 200 by using the object detection device 200, it is possible to realize the detection of a small object as compared with the case where the object detection device 200''2 is used. That is, it is possible to detect another vehicle or the like traveling at a position far from the position of the own vehicle.
- FIG. 28 is a line graph showing the experimental results relating to the detection accuracy by each of the object detection device 200 and the object detection device 200'_1 when 2class is used.
- FIG. 29 is a line graph showing the experimental results relating to the detection accuracy by each of the object detection device 200 and the object detection device 200 _1 when 4class is used.
- FIG. 30 is a line graph showing the experimental results relating to the detection accuracy by each of the object detection device 200 and the object detection device 200 _1 when 8 class is used.
- the unit of the numerical value on the vertical axis in FIGS. 28 to 30 is mAP (mean Average Precision).
- the mAP is an accuracy evaluation index showing the recognition rate at which an object is captured.
- the number of layers in VGGNet is set to 4.
- each numerical value indicated by “approach2 (mul)” is generated by the third feature map FM3 using the above-mentioned ⁇ generation method (1) by multiplication> only in the first layer of the prominence block layer L1.
- the experimental results relating to the detection accuracy in the object detection apparatus 200 in the case of the above are shown.
- Each numerical value shown in "experiment 2 (add)” is an object detection device when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (1) by addition> only in the first layer remarkable block layer L1.
- the experimental results relating to the detection accuracy in 200 are shown.
- Each numerical value shown in “experiment 3 (mul)” is an object detection device when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (2) by multiplication> only in the first layer of the prominence block layer L1.
- the experimental results relating to the detection accuracy in 200 are shown.
- Each numerical value shown in “experiment 3 (add)” is an object detection device when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (2) by addition> only in the first layer remarkable block layer L1.
- the experimental results relating to the detection accuracy in 200 are shown.
- Each numerical value shown in “experiment 4" relates to the detection accuracy in the object detection device 200 when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (1) by addition> in each prominence block layer L1.
- the experimental results are shown.
- Each numerical value shown in “experiment4_advance_v1” relates to the detection accuracy in the object detection device 200 when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (1) by multiplication> in each prominence block layer L1.
- the experimental results are shown.
- Each numerical value shown in “experiment4_advance_v2" relates to the detection accuracy in the object detection device 200 when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (2) by addition> in each prominence block layer L1.
- the experimental results are shown.
- Each numerical value shown in “experiment4_advance_v3” relates to the detection accuracy in the object detection device 200 when the third feature map FM3 is generated by using the above-mentioned ⁇ generation method (3) by addition> in each prominence block layer L1.
- the experimental results are shown.
- each numerical value indicated by "VGG” indicates an experimental result relating to the detection accuracy in the object detection device 200'_1.
- the object detection device 200 by using the object detection device 200, it is possible to improve the detection accuracy for the object as compared with the case where the object detection device 200'_1 is used. That is, the accuracy of object detection can be improved.
- the evaluation of Small is important for an in-vehicle electronic mirror that is required to use a processor 41 or a processing circuit 43 having a low computing power, while it is required to realize detection of a small object.
- the calculation speed becomes explosively slow. Therefore, it tends to be difficult to detect a small object while reducing the amount of calculation.
- the object detection device 200 can acquire a feature amount sufficient for detecting a small object while reducing the calculation amount. By using the object detection device 200, the amount of calculation can be reduced and the detection of a small object can be realized.
- the individual first feature quantity may be any one using medium-level features corresponding to the object-likeness. That is, the first feature amount is not limited to the remarkableness.
- the first feature map is not limited to the saliency map.
- the first feature map generation unit 31 may generate a depth map (Deptth Map) using a distance image or a sonar image corresponding to each captured image.
- the first feature map generation unit 31 may generate a thermal map (Thermal Map) using a temperature image corresponding to each captured image. That is, the weighting in the second feature amount extraction unit 23 may be based on the so-called “Middle-level Sensor Fusion”.
- the distance image or sonar image is obtained from, for example, a distance sensor, a millimeter wave radar, a sonar sensor, or an infrared sensor.
- the temperature image is obtained, for example, from a thermal sensor. Since the distance sensor can correctly measure the distance to the object, when the distance image obtained from the distance sensor is used, the accuracy of the first feature map showing the object-likeness is high. Millimeter-wave radar can accurately measure the distance to an object even in bad weather.
- the sonar sensor or the infrared sensor can measure the position of an object at a short distance at low cost.
- the thermal sensor is suitable for shooting at night.
- the first feature map generated by the first feature map generation unit 31 shall be at least one of a saliency map based on a captured image, a depth map based on a distance image or a sonar image, and a heat map based on a thermal image. Can be done.
- the first feature map generation unit 31 generates the first feature map using, for example, a distance image, a sonar image, or a temperature image, and as described above, produces a first feature map according to the feature to be extracted. In addition to being able to generate, it is possible to generate a first feature map with high anonymity from the viewpoint of privacy protection.
- the thermal map is suitable for use as a first feature map when a person is desired to be detected because the region corresponding to the person is activated. Further, the thermal map generated by using the temperature image is more excellent in nighttime person detection than the first feature map generated by using the captured image.
- the method of generating the first feature map FM1 by the first feature map generation unit 31 is not limited to the saliency estimation.
- the first feature map generation unit 31 executes at least one of image gradient detection (Edge Detection), object-likeness estimation (Objectness Estimation), and region segmentation (Segmentation) in place of or in addition to the saliency estimation. By doing so, the first feature map FM1 may be generated.
- the object detection in the object detection unit 24 is not limited to the SSD.
- the object detection in the object detection unit 24 may be performed by RetinaNet, Mask R-CNN, YOLO, or Faster R-CNN. Further, for example, the object detection in the object detection unit 24 may be performed by EffectDet (see Reference 3 below).
- the object detection device 200 may have a learning unit 25.
- the learning unit 25 in the object detection device 200 may use the image captured by the camera 1 as the learning image to learn the second feature amount extraction unit 23 and the object detection unit 24.
- the learning unit 25 in the object detection device 200 may generate a machine learning model that takes an image captured by the camera 1 as an input and outputs a detection result of each object in the captured image.
- the object detection device 200 has an image data acquisition unit 21 that acquires image data indicating an image captured by the camera 1, and a first feature map FM1 that uses the image data.
- the second feature map FM2 is generated by using the 1 feature amount extraction unit 22 and the image data, and the second feature map FM2 is added or multiplied by the first feature map FM1 to the second feature map FM2.
- It includes a second feature amount extraction unit 23 that generates a third feature map FM3 by weighting the feature map FM2, and an object detection unit 24 that detects an object in a captured image using the third feature map FM3.
- the first feature amount in the first feature map FM1 uses the medium-level feature corresponding to the object-likeness
- the second feature amount in the second feature map FM2 uses the high-level feature. This makes it possible to improve the accuracy of object detection. In addition, the amount of calculation can be reduced. Moreover, it is possible to realize the detection of a small object.
- the learning device 400 has an image data acquisition unit 21 that acquires image data indicating an image for learning, and a first feature amount extraction unit 22 that generates a first feature map FM1 using the image data. Then, the second feature map FM2 is generated using the image data, and the second feature map FM2 is added or multiplied using the first feature map FM1 to weight the second feature map FM2.
- the second feature amount extraction unit 23 that generates the third feature map FM3, the object detection unit 24 that detects an object in the learning image using the third feature map FM3, and the detection result by the object detection unit 24.
- a learning unit 25 for learning the second feature amount extraction unit 23 and the object detection unit 24 is provided accordingly, and the first feature amount in the first feature map FM1 uses medium-level features corresponding to the object-likeness.
- the second feature amount in the second feature map FM2 uses high-level features. Thereby, the learning device 400 for the object detection device 200 can be realized.
- FIG. 32 is a block diagram showing a main part of an object detection system including the object detection device according to the second embodiment. An object detection system including the object detection device according to the second embodiment will be described with reference to FIG. 32. In FIG. 32, the same blocks as those shown in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.
- the object detection system 100a includes a camera 1, a storage device 2, a clock 4, a storage device 5, and an object detection device 200a.
- the storage device 2 has a feature map storage unit 11.
- the storage device 5 has a time-based parameter storage unit 13.
- the object detection device 200a includes an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, an object detection unit 24, a time information acquisition unit 26, and a parameter selection unit 27.
- the storage device 5 is composed of a memory.
- the time information acquisition unit 26 acquires information indicating the time (hereinafter referred to as "time information") using the clock 4.
- the time information indicates, for example, the current time.
- the time-based parameter storage unit 13 stores a database (hereinafter referred to as "time-based learned parameter database") including a plurality of machine learning models in which a plurality of parameter sets are set.
- the individual parameter sets include trained parameters for the first neural network NN1 and include trained parameters for the second neural network NN2.
- the plurality of parameter sets included in the time-based learned parameter database correspond to different time zones.
- the trained parameter database by time has a parameter set corresponding to daytime, a parameter set corresponding to evening, a parameter set corresponding to dusk, and a parameter corresponding to nighttime. It includes a set.
- the parameter selection unit 27 selects the parameter set corresponding to the time zone including the time indicated by the time information from the plurality of parameter sets included in the time-based learned parameter database.
- the parameter selection unit 27 sets the parameters in the first neural network NN1 and sets the parameters in the second neural network NN2 using the selected parameter set.
- the second feature amount extraction unit 23 executes the second feature amount extraction process using the parameters set by the parameter selection unit 27.
- the object detection unit 24 is configured to execute the object detection process using the parameters set by the parameter selection unit 27.
- the second feature amount extraction unit 23 executes the second feature amount extraction process using the learned parameters included in the parameter set selected by the parameter selection unit 27.
- the object detection unit 24 is configured to execute the object detection process using the learned parameters included in the parameter set selected by the parameter selection unit 27.
- FIG. 33 is a block diagram showing a main part of the learning system including the learning device according to the second embodiment.
- a learning system including the learning device according to the second embodiment will be described with reference to FIG. 33.
- the same blocks as those shown in FIG. 4 are designated by the same reference numerals and the description thereof will be omitted.
- the learning system 300a includes a storage device 2, a storage device 3a, a storage device 5, and a learning device 400.
- the storage device 2 has a feature map storage unit 11.
- the storage device 3a has a time-based image data storage unit 14.
- the storage device 5 has a time-based parameter storage unit 13.
- the learning device 400 has an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, an object detection unit 24, and a learning unit 25.
- the time-based image data storage unit 14 stores a plurality of learning image databases.
- the plurality of learning image databases correspond to different time zones.
- a plurality of learning image databases include a learning image database corresponding to daytime, a learning image database corresponding to evening, a learning image database corresponding to dusk, and a learning image database corresponding to nighttime. ..
- the plurality of learning images included in the individual learning image databases are taken by a camera similar to the camera 1 at a time within the corresponding time zone.
- the learning of the second feature amount extraction unit 23 and the object detection unit 24 by the learning unit 25 is executed by using the individual learning image databases. That is, such learning is executed for each learning image database. As a result, a plurality of machine learning models in which a plurality of parameter sets corresponding to different time zones are set are generated.
- the learning unit 25 stores a plurality of machine learning models in which the generated plurality of parameter sets are set in the time-based parameter storage unit 13. As a result, a trained parameter database for each time is generated.
- the code of "F6" may be used for the function of the time information acquisition unit 26. Further, the reference numeral of "F7" may be used for the function of the parameter selection unit 27.
- time information acquisition process the processes executed by the time information acquisition unit 26 may be collectively referred to as "time information acquisition process”.
- parameter selection unit 27 may be collectively referred to as “parameter selection process”.
- the object detection device 200a has a plurality of functions F1 to F4, F6, and F7.
- Each of the plurality of functions F1 to F4, F6, and F7 may be realized by the processor 41 and the memory 42, or may be realized by the dedicated processing circuit 43.
- the processing circuit 43 may include a plurality of processing circuits corresponding to a plurality of functions F1 to F4, F6, and F7.
- the hardware configuration of the main part of the learning device 400 is the same as that described with reference to FIGS. 7 and 8 in the first embodiment. Therefore, illustration and description will be omitted.
- FIG. 34 the same steps as those shown in FIG. 9 are designated by the same reference numerals and the description thereof will be omitted.
- the time information acquisition unit 26 executes the time information acquisition process (step ST5).
- the parameter selection unit 27 executes the parameter selection process (step ST6).
- the processes of steps ST1 to ST4 are executed.
- the operation of the learning device 400 is the same as that described with reference to the flowchart of FIG. 10 in the first embodiment. Therefore, illustration and description will be omitted.
- the object detection device 200a can employ various modifications similar to those described in the first embodiment.
- the object detection device 200a corresponds to the time information acquisition unit 26 for acquiring the time information and the time indicated by the time information in the parameter set included in the time-based learned parameter database.
- the second feature amount extraction unit 23 includes a parameter selection unit 27 for selecting a parameter set to be selected, and the second feature amount extraction unit 23 uses the learned parameters included in the parameter set selected by the parameter selection unit 27 to use the second feature map FM2 and the second feature map FM2. 3 Generate a feature map FM3. This makes it possible to further improve the accuracy of object detection.
- FIG. 35 is a block diagram showing a main part of an object detection system including the object detection device according to the third embodiment.
- An object detection system including the object detection device according to the third embodiment will be described with reference to FIG. 35.
- the same blocks as those shown in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.
- the object detection system 100b includes a camera 1, a storage device 2, a locator 6, a storage device 7, and an object detection device 200b.
- the storage device 2 has a feature map storage unit 11.
- the storage device 7 has a location-specific parameter storage unit 15.
- the object detection device 200b includes an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, an object detection unit 24, a location information acquisition unit 28, and a parameter selection unit 29.
- the storage device 7 is composed of a memory.
- the location information acquisition unit 28 uses the locator 6 to acquire information indicating the location (hereinafter referred to as "location information"). More specifically, the location information indicates the type of location corresponding to the current position of the own vehicle. For example, the location information indicates whether the location corresponding to the current position of the own vehicle is in an urban area (urban area), a highway, or a suburb (suburbs).
- location information indicates the type of location corresponding to the current position of the own vehicle. For example, the location information indicates whether the location corresponding to the current position of the own vehicle is in an urban area (urban area), a highway, or a suburb (suburbs).
- the location-specific parameter storage unit 15 stores a database including a plurality of machine learning models in which a plurality of parameter sets are set (hereinafter referred to as "location-specific trained parameter database").
- the individual parameter sets include trained parameters for the first neural network NN1 and include trained parameters for the second neural network NN2.
- the plurality of parameter sets included in the learned parameter database for each location correspond to different locations.
- the learned parameter database by location includes a parameter set corresponding to the metropolitan area, a parameter set corresponding to an arterial road, and a parameter set corresponding to the suburbs.
- the parameter selection unit 29 selects the parameter set corresponding to the location indicated by the location information from the plurality of parameter sets included in the location-based learned parameter database.
- the parameter selection unit 29 sets the parameters in the first neural network NN1 and sets the parameters in the second neural network NN2 using the selected parameter set.
- the second feature amount extraction unit 23 executes the second feature amount extraction process using the parameters set by the parameter selection unit 29.
- the object detection unit 24 is configured to execute the object detection process using the parameters set by the parameter selection unit 29.
- the second feature amount extraction unit 23 executes the second feature amount extraction process using the learned parameters included in the parameter set selected by the parameter selection unit 29.
- the object detection unit 24 is configured to execute the object detection process using the learned parameters included in the parameter set selected by the parameter selection unit 29.
- FIG. 36 is a block diagram showing a main part of a learning system including the learning device according to the third embodiment.
- a learning system including the learning device according to the third embodiment will be described with reference to FIG. 36.
- the same blocks as those shown in FIG. 4 are designated by the same reference numerals and the description thereof will be omitted.
- the learning system 300b includes a storage device 2, a storage device 3b, a storage device 7, and a learning device 400.
- the storage device 2 has a feature map storage unit 11.
- the storage device 3b has a location-specific image data storage unit 16.
- the storage device 7 has a location-specific parameter storage unit 15.
- the learning device 400 has an image data acquisition unit 21, a first feature amount extraction unit 22, a second feature amount extraction unit 23, an object detection unit 24, and a learning unit 25.
- the location-specific image data storage unit 16 stores a plurality of learning image databases.
- a plurality of learning image databases correspond to different locations from each other.
- the plurality of learning image databases include a learning image database corresponding to an urban area, a learning image database corresponding to a highway, and a learning image database corresponding to a suburb.
- the plurality of learning images included in the individual learning image databases are taken by the same camera as the camera 1 at the corresponding places.
- the learning of the second feature amount extraction unit 23 and the object detection unit 24 by the learning unit 25 is executed by using the individual learning image databases. That is, such learning is executed for each learning image database. This will generate a plurality of parameter sets corresponding to different locations.
- the learning unit 25 stores the generated plurality of parameter sets in the location-specific parameter storage unit 15. As a result, a trained parameter database for each location is generated.
- the code of "F8" may be used for the function of the location information acquisition unit 28. Further, the reference numeral of "F9" may be used for the function of the parameter selection unit 29.
- location information acquisition processing may be collectively referred to as “location information acquisition processing”.
- parameter selection unit 29 may be collectively referred to as “parameter selection process”.
- the hardware configuration of the main part of the object detection device 200b is the same as that described with reference to FIGS. 5 and 6 in the first embodiment. Therefore, illustration and description will be omitted. That is, the object detection device 200b has a plurality of functions F1 to F4, F8, and F9. Each of the plurality of functions F1 to F4, F8, and F9 may be realized by the processor 41 and the memory 42, or may be realized by the dedicated processing circuit 43. Further, the processing circuit 43 may include a plurality of processing circuits corresponding to a plurality of functions F1 to F4, F8, F9.
- the hardware configuration of the main part of the learning device 400 is the same as that described with reference to FIGS. 7 and 8 in the first embodiment. Therefore, illustration and description will be omitted.
- FIG. 37 the same steps as those shown in FIG. 9 are designated by the same reference numerals and the description thereof will be omitted.
- the location information acquisition unit 28 executes the location information acquisition process (step ST7).
- the parameter selection unit 29 executes the parameter selection process (step ST8).
- the processes of steps ST1 to ST4 are executed.
- the operation of the learning device 400 is the same as that described with reference to FIG. 10 in the first embodiment. Therefore, illustration and description will be omitted.
- the accuracy of object detection can be further improved. That is, an appropriate degree of freedom in the network can be realized.
- the object detection device 200b can employ various modifications similar to those described in the first embodiment.
- the object detection device 200b corresponds to the place information acquisition unit 28 for acquiring the place information and the place indicated by the place information in the parameter set included in the learned parameter database for each place.
- the second feature amount extraction unit 23 includes a parameter selection unit 29 for selecting a parameter set to be selected, and the second feature amount extraction unit 23 uses the learned parameters included in the parameter set selected by the parameter selection unit 29 to use the second feature map FM2 and the second feature map FM2. 3 Generate a feature map FM3. This makes it possible to further improve the accuracy of object detection.
- FIG. 38 is a block diagram showing a main part of a monitoring system including the monitoring device according to the fourth embodiment.
- FIG. 39 is a block diagram showing a main part of an analysis unit and an output control unit in the monitoring device according to the fourth embodiment.
- a monitoring system including the monitoring device according to the fourth embodiment will be described with reference to FIGS. 38 and 39.
- FIG. 38 the same blocks as those shown in FIG. 1 are designated by the same reference numerals and the description thereof will be omitted.
- the monitoring system 500 includes a camera 1, a storage device 2, an output device 8, and a monitoring device 600.
- the monitoring device 600 includes an object detection device 200, an analysis unit 51, and an output control unit 52.
- the analysis unit 51 has an abnormality determination unit 61, a time analysis unit 62, a threat determination unit 63, and a spatial analysis unit 64.
- the output control unit 52 has an image output control unit 65 and an audio output control unit 66.
- the output device 8 includes a display 71 and a speaker 72.
- the camera 1 is composed of, for example, a surveillance camera, a security camera, or a camera for an electronic mirror.
- the display 71 is composed of a display for an electronic mirror. That is, in this case, the camera 1 and the display 71 constitute the main part of the electronic mirror.
- an example in this case will be mainly described.
- the abnormality determination unit 61 determines the degree of abnormality A of each object by using the detection result by the object detection unit 24. More specifically, the abnormality determination unit 61 determines the degree of abnormality A based on the position of each object by using the estimation result by the position estimation unit 34.
- the other vehicle when another vehicle is detected by the object detection unit 24 and the other vehicle is located at a normal position (for example, a position corresponding to an inter-vehicle distance of a predetermined value or more), the other vehicle is abnormal.
- the degree of abnormality A is set to a smaller value than when the vehicle is located at a position (for example, a position corresponding to an inter-vehicle distance less than a predetermined value).
- the degree of abnormality A when the other vehicle is located at an abnormal position (same as above), the degree of abnormality A is larger than when the other vehicle is located at a normal position (same as above).
- the time analysis unit 62 analyzes the detection result by the object detection unit 24 in time. That is, the time analysis unit 62 temporally analyzes the results of a plurality of times of object detection processing corresponding to a plurality of captured images that are continuous in time. In other words, the time analysis unit 62 analyzes the results of the object detection processing for a plurality of frames in time. As a result, the time analysis unit 62 calculates the time change amount ⁇ S of the size of each object in the moving image captured by the camera 1.
- the time analysis unit 62 calculates the expansion rate per unit time of the bounding box corresponding to each object.
- the time analysis unit 62 calculates the time change amount ⁇ S by integrating the calculated expansion coefficient.
- the threat determination unit 63 determines the threat degree T of each object by using the detection result by the object detection unit 24. More specifically, the threat determination unit 63 determines the threat degree T based on the traveling direction of each object by using the estimation result by the type estimation unit 35.
- the class classified by the type estimation unit 35 includes the traveling direction of the object. Therefore, for example, when another vehicle is detected by the object detection unit 24, when the other vehicle is a following vehicle or an overtaking vehicle, the threat level T is higher than when the other vehicle is an oncoming vehicle. Set to a large value. On the other hand, in this case, when the other vehicle is an oncoming vehicle, the threat degree T is set to a smaller value than when the vehicle is a following vehicle or an overtaking vehicle.
- the threat determination unit 63 determines the threat degree T of each object by using the analysis result by the time analysis unit 62.
- the threat determination unit 63 executes the following operations for each object.
- the threat determination unit 63 compares the calculated time change amount ⁇ S with the threshold value ⁇ Sth.
- the threat degree T is set to a larger value than when the time change amount ⁇ S is equal to or less than the threshold value ⁇ Sth.
- the threshold value ⁇ Sth is set to a value based on the average value ⁇ S_ave of the time change amount ⁇ S calculated in the past for the corresponding object.
- the spatial analysis unit 64 generates a risk map by spatially analyzing the determination result by the abnormality determination unit 61 and the determination result by the threat determination unit 63.
- the risk map is composed of a plurality of risk values arranged in a two-dimensional manner.
- the individual risk values are weighted values according to the corresponding anomaly degree A and weighted according to the corresponding threat degree T.
- the analysis unit 51 analyzes the detection result by the object detection unit 24.
- the image output control unit 65 outputs an image signal corresponding to the analysis result by the analysis unit 51 to the display 71. As a result, the image output control unit 65 executes control for displaying the image corresponding to the analysis result by the analysis unit 51 on the display 71. Further, the voice output control unit 66 outputs a voice signal corresponding to the analysis result by the analysis unit 51 to the speaker 72. As a result, the voice output control unit 66 executes control to output the voice corresponding to the analysis result by the analysis unit 51 to the speaker 72.
- the output control unit 52 outputs a signal corresponding to the analysis result by the analysis unit 51 to the output device 8.
- the signals output by the output control unit 52 may be collectively referred to as “analysis result signal”.
- the image signal output by the image output control unit 65 may indicate an image including a risk map generated by the spatial analysis unit 64 (hereinafter referred to as “risk map image”).
- risk map image may be displayed on the display 71.
- FIG. 40 shows an example of a risk map image.
- the risk values in the two regions A1 and A2 are set to be higher than the risk values in the other regions.
- the colors in the two regions A1 and A2 are displayed as different colors from the colors in the other regions.
- the two areas A1 and A2 correspond to, for example, two other vehicles, respectively.
- the individual risk values in the risk map are visualized.
- the risk value can be visually presented to the passengers of the own vehicle.
- the code of "F11” may be used for the function of the analysis unit 51. Further, the reference numeral of "F12" may be used for the function of the output control unit 52.
- the processes executed by the object detection device 200 may be collectively referred to as "object detection process, etc.” That is, the object detection process and the like include an image data acquisition process, a first feature amount extraction process, a second feature amount extraction process, and an object detection process. Further, the processes executed by the analysis unit 51 may be collectively referred to as “analysis process”. Further, the processing and control executed by the output control unit 52 may be collectively referred to as "output control”.
- the monitoring device 600 has a processor 81 and a memory 82.
- the memory 82 stores programs corresponding to a plurality of functions F1 to F4, F11, and F12.
- the processor 81 reads out and executes the program stored in the memory 82. As a result, a plurality of functions F1 to F4, F11, and F12 are realized.
- the monitoring device 600 has a processing circuit 83.
- a plurality of functions F1 to F4, F11, and F12 are realized by the dedicated processing circuit 83.
- the monitoring device 600 has a processor 81, a memory 82, and a processing circuit 83 (not shown).
- some of the functions of the plurality of functions F1 to F4, F11 and F12 are realized by the processor 81 and the memory 82, and the remaining functions of the plurality of functions F1 to F4, F11 and F12 are realized. Is realized by the dedicated processing circuit 83.
- the processor 81 is composed of one or more processors.
- the individual processors use, for example, CPUs, GPUs, microprocessors, microcontrollers or DSPs.
- the memory 82 is composed of one or more non-volatile memories.
- the memory 82 is composed of one or more non-volatile memories and one or more volatile memories. That is, the memory 82 is composed of one or more memories.
- the individual memory uses, for example, a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, or a magnetic tape.
- each volatile memory uses, for example, RAM.
- non-volatile memory for example, ROM, flash memory, EPROM, EEPROM, solid state drive, hard disk drive, flexible disk, compact disk, DVD, Blu-ray disk or mini disk are used.
- the processing circuit 83 is composed of one or more digital circuits.
- the processing circuit 83 is composed of one or more digital circuits and one or more analog circuits. That is, the processing circuit 83 is composed of one or more processing circuits.
- Each processing circuit uses, for example, an ASIC, PLD, FPGA, SoC or system LSI.
- the processing circuit 83 is composed of a plurality of processing circuits
- the correspondence between the plurality of functions F1 to F4, F11, F12 and the plurality of processing circuits is arbitrary.
- the monitoring device 600 may have a plurality of processing circuits having a one-to-one correspondence with a plurality of functions F1 to F4, F11, and F12.
- each of the plurality of functions F1 to F4, F11, and F12 may be realized exclusively by the corresponding one processing circuit among the plurality of processing circuits.
- the object detection device 200 executes an object detection process or the like (step ST21).
- the analysis unit 51 executes the analysis process (step ST22).
- the output control unit 52 executes output control (step ST23).
- the monitoring device 600 may have an object detection device 200a instead of the object detection device 200.
- the monitoring system 500 may include a clock 4 and a storage device 5.
- the monitoring device 600 may have an object detection device 200b instead of the object detection device 200.
- the monitoring system 500 may include a locator 6 and a storage device 7.
- the analysis unit 51 may have only one of the abnormality determination unit 61 and the threat determination unit 63.
- the analysis unit 51 has only the abnormality determination unit 61, the individual risk values in the risk map are weighted by the corresponding abnormality degree A.
- the analysis unit 51 has only the threat determination unit 63, the individual risk values in the risk map are weighted by the corresponding threat degree T.
- the threat determination unit 63 executes only one of the determination of the threat degree T based on the estimation result by the type estimation unit 35 and the determination of the threat degree T based on the analysis result by the time analysis unit 62. May be.
- the output control unit 52 may have only one of the image output control unit 65 and the audio output control unit 66.
- the output device 8 may include only the display 71 of the display 71 and the speaker 72.
- the output control unit 52 has only the audio output control unit 66, the output device 8 may include only the speaker 72 of the display 71 and the speaker 72.
- the time analysis unit 62 analyzes the detection result by the object detection unit 24 in time. From the viewpoint corresponding to such analysis, the object detection device 200, the object detection device 200a, or the object detection device 200b in the monitoring device 600 may be configured as follows.
- the image data acquisition unit 21 may acquire image data corresponding to a plurality of captured images (that is, still images for a plurality of frames) that are continuous in time. That is, the image data acquisition unit 21 may acquire time-series data.
- the first feature amount extraction unit 22 may generate a feature map (that is, the first feature map FM1) including temporal information by using the acquired time series data. Further, the second feature amount extraction unit 23 uses the acquired time-series data to generate a feature map (that is, an individual second feature map FM2 and an individual third feature map FM3) including temporal information. It may be something to do.
- the first neural network NN1 may have a structure for processing the acquired time-series data in a time-series manner.
- the CNN in the first neural network NN1 may be one using an LSTM (Long Short Term Memory) network.
- the monitoring device 600 includes the object detection device 200, the object detection device 200a or the object detection device 200b, the analysis unit 51 for analyzing the detection result by the object detection unit 24, and the analysis unit 51.
- An output control unit 52 that outputs an analysis result signal corresponding to the analysis result according to the above is provided. This makes it possible to realize monitoring based on the result of highly accurate object detection.
- the object detection device, monitoring device and learning device according to the present disclosure can be used, for example, for an electronic mirror.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
Description
図1は、実施の形態1に係る物体検出装置を含む物体検出システムの要部を示すブロック図である。図2は、実施の形態1に係る物体検出装置における第1特徴量抽出部、第2特徴量抽出部及び物体検出部の要部を示すブロック図である。図1及び図2を参照して、実施の形態1に係る物体検出装置を含む物体検出システムについて説明する。
国際公開第2018/051459号
以下、第3特徴マップ生成部33による、複数個の第3特徴マップFM3の生成方法の具体例について説明する。
例えば、第3特徴マップ生成部33は、第1特徴マップFM1における個々の第1特徴量を個々の第2特徴マップFM2における対応する第2特徴量に足し合わせる足し算を行う。具体的には、第3特徴マップ生成部33は、まず、1個の第1特徴マップFM1を、第2特徴マップFM2の数だけ複製する。そして、第3特徴マップ生成部33は、複製した第1特徴マップFM1をそれぞれ個々の第2特徴マップFM2と対応付け、レイヤー毎に、ピクセル単位で足し合わせる。すなわち、第3特徴マップ生成部33は、第1特徴マップFM1と第2特徴マップFM2とを、空間的に足し合わせる。
これにより、第3特徴マップ生成部33は、第1特徴マップFM1を用いた、第2特徴マップFM2に対する重み付けをする。すなわち、第3特徴マップ生成部33は、個々の第2特徴マップFM2における対応する第2特徴量に対する重み付けをする。
例えば、第3特徴マップ生成部33は、第1特徴マップFM1における個々の第1特徴量を個々の第2特徴マップFM2における対応する第2特徴量に掛け合わせる掛け算を行う。具体的には、第3特徴マップ生成部33は、まず、1個の第1特徴マップFM1を、第2特徴マップFM2の数だけ複製する。そして、第3特徴マップ生成部33は、複製した第1特徴マップFM1をそれぞれ個々の第2特徴マップFM2と対応付け、レイヤー毎に、ピクセル単位で掛け合わせる。すなわち、第3特徴マップ生成部33は、第1特徴マップFM1と第2特徴マップFM2とを、空間的に掛け合わせる。
これにより、第3特徴マップ生成部33は、第1特徴マップFM1を用いた、第2特徴マップFM2に対する重み付けをする。すなわち、第3特徴マップ生成部33は、個々の第2特徴マップFM2における対応する第2特徴量に対する重み付けをする。
当該生成方法においては、第1特徴量抽出部22の第4特徴マップ生成部36が、第1特徴マップ生成部31によって生成された第1特徴マップFM1から、当該第1特徴マップに対応する複数個の第4特徴マップFM4を生成していることを前提とする。
例えば、第3特徴マップ生成部33は、第4特徴マップFM4における個々の第4特徴量を第4特徴マップに対応する第2特徴マップFM2における対応する第2特徴量に足し合わせる足し算を行う。具体的には、第3特徴マップ生成部33は、個々の第4特徴マップFM4と個々の第2特徴マップFM2とを対応付け、レイヤー毎に、ピクセル単位で足し合わせる。すなわち、第3特徴マップ生成部33は、第4特徴マップFM4と第2特徴マップFM2とを、空間的に足し合わせる。
これにより、第3特徴マップ生成部33は、第1特徴マップFM1を用いた、より詳細には、第1特徴マップFM1を用いて生成された第4特徴マップFM4を用いた、第2特徴マップFM2に対する重み付けをする。すなわち、第3特徴マップ生成部33は、個々の第2特徴マップFM2における対応する第2特徴量に対する重み付けをする。
当該生成方法においても、上述の<足し算による生成方法(2)>同様、第1特徴量抽出部22の第4特徴マップ生成部36が、第1特徴マップ生成部31によって生成された第1特徴マップFM1から、当該第1特徴マップに対応する複数個の第4特徴マップFM4を生成していることを前提とする。
例えば、第3特徴マップ生成部33は、第4特徴マップFM4における個々の第4特徴量を個々の第2特徴マップFM2における対応する第2特徴量に掛け合わせる掛け算を行う。具体的には、第3特徴マップ生成部33は、個々の第4特徴マップFM4と個々の第2特徴マップFM2とを対応付け、レイヤー毎に、ピクセル単位で掛け合わせる。すなわち、第3特徴マップ生成部33は、第4特徴マップFM4と第2特徴マップFM2とを、空間的に掛け合わせる。
これにより、第3特徴マップ生成部33は、第1特徴マップFM1を用いた、より詳細には、第1特徴マップFM1を用いて生成された第4特徴マップFM4を用いた、第2特徴マップFM2に対する重み付けをする。すなわち、第3特徴マップ生成部33は、個々の第2特徴マップFM2における対応する第2特徴量に対する重み付けをする。
例えば、第3特徴マップ生成部33は、第1特徴マップFM1を、複数個の第2特徴マップFM2の次元方向、言い換えれば、チャネル方向に足し合わせる足し算を行う。言い換えれば、第3特徴マップ生成部33は、第1特徴マップFM1を、複数個の第2特徴マップFM2の次元方向に連結(concatenete)する。具体的には、第3特徴マップ生成部33は、1個の第1特徴マップFM1を、例えば、第2特徴マップFM2の数だけ複製する。そして、第3特徴マップ生成部33は、複製した第1特徴マップFM1を、複数個の第2特徴マップFM2の次元方向に足し合わせる。
これにより、第3特徴マップ生成部33は、第1特徴マップFM1を用いた第2特徴マップFM2に対する重み付けをする。すなわち、第3特徴マップ生成部33は、個々の第2特徴マップFM2に対して、次元の数を増やす重み付けをする。
物体の構造を評価するSSIM指標又はピクセル単位での相関類似度の指標を用いて重要度Wを設定することで、第3特徴マップ生成部33は、第3特徴マップFM3を用いた、撮像画像における物体検出精度を高めることができる。なお、第3特徴マップFM3を用いた、撮像画像における物体の検出は、物体検出部24が行う。
つまり、学習部25は、画像データ取得部21による取得された画像データを入力とし、個々の撮像画像における個々の物体の検出結果を出力する学習済みのモデル(以下「機械学習モデル」という。)を生成する。機械学習モデルにおいて、複数個のパラメータセットが設定されている。個々のパラメータセットは、第1ニューラルネットワークNN1用の学習済みパラメータを含むものであり、かつ、第2ニューラルネットワークNN2用の学習済みパラメータを含むものである。
なお、個々の撮像画像における個々の物体の検出結果とは、具体的には、個々の撮像画像における個々の物体の位置の推定結果、及び、個々の物体の種別の推定結果である。機械学習モデルは、例えば、記憶装置(不図示)に記憶される。
Mingxing Tan, Quoc Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6105-6114, 2019, http://proceedings.mlr.press/v97/tan19a/tan19a.pdf
なお、図13~図21において、図中「Input image」は、画像データ取得部21により取得された画像データが示す撮像画像又は学習用画像を示している。実施の形態1では、上述のとおり、カメラ1は電子ミラー用のカメラにより構成され、車両に設けられているものとしているが、図13~図21では、便宜上、画像データ取得部21による取得された画像データは、例えば、海岸を撮像する監視カメラで構成されているカメラ1により撮像された画像データとしている。図中「Saliency Map」は、第1特徴マップ生成部31により生成された第1特徴マップFM1を示している。図中「Feature Map」は、第2特徴マップ生成部32により生成される個々の第2特徴マップFM2、第3特徴マップ生成部33により生成される個々の第3特徴マップFM3、又は、第4特徴マップ生成部36により生成される個々の第4特徴マップFM4を示している。
図14は、上述の<掛け算による生成方法(1)>を用いて第3特徴マップFM3が生成されるイメージを説明するための図である。
図13及び図14は、例えば、一層目の顕著性ブロック層L1においてのみ、第1特徴マップFM1を用いて、複数個の第2特徴マップFM2のうちの対応する第2特徴マップFM2が生成されるとともに、当該生成された第2特徴マップFM2に対する重み付けがなされ、第3特徴マップFM3が生成されるイメージを示している。
図13及び図14に示すように、個々の第1特徴マップFM1において、検出対象となる物体(ここでは人)に対応する領域が活性化している。第1特徴マップFM1上、活性化した領域の第1特徴量には大きな値が設定されている。なお、第1特徴マップFM1においては、遠くに存在する小さい物体に対応する領域も活性化する。
これに対し、例えば、図14に示すように、第2特徴マップFM2と第1特徴マップFM1とを掛け算し、第1特徴量が第2特徴量に空間的に掛け合わされることで、冗長な第2特徴量は切り捨てられる。なお、第1特徴マップFM1上、物体検出において意味のない第1特徴量には、例えば「0」が設定されている。「0」が掛け合わされることで、第2特徴量は「0」となる。これにより、前景物体の学習が阻害されることを防ぐことができる。
図16は、上述の<掛け算による生成方法(2)>を用いて第3特徴マップFM3が生成されるイメージを説明するための図である。
図15及び図16は、例えば、一層目の顕著性ブロック層L1においてのみ、第1特徴マップFM1を用いて、複数個の第2特徴マップFM2のうちの対応する第2特徴マップFM2が生成されるとともに、当該生成された第2特徴マップFM2に対する重み付けがなされ、第3特徴マップFM3が生成されるイメージを示している。
図15及び図16に示すように、第1特徴マップFM1において、検出対象となる物体(ここでは人)に対応する領域が活性化している。当該第1特徴マップFM1から、複数個の第4特徴マップFM4が生成される。複数個の第4特徴マップFM4は、畳み込みにより生成されるため、それぞれ、特徴量の取り方の異なる特徴マップとなっている。なお、第4特徴マップ生成部36が複数個の第4特徴マップFM4を生成するために行う畳み込みの演算内容は、第2特徴量抽出部23が複数個の第2特徴マップFM2を生成する際に行う畳み込みの演算内容と同じである。
また、例えば、図16に示すように、個々の第4特徴マップFM4における個々の第4特徴量と、対応する第2特徴マップFM2における対応する第2特徴量とが掛け合わされることで、異なるバリエーションを持つ特徴量の組み合わせでの掛け算が行われることになる。これにより、図14に示すように、1個の第1特徴マップFM1を複製してそれぞれ第2特徴マップFM2に掛け合わせる<掛け算による生成方法(2)>と比べ、より高度な、空間的な掛け算が実現できる。
これに限らず、第3特徴マップFM3は、例えば、各顕著性ブロック層L1において、<足し算による生成方法(1)>、<掛け算による生成方法(1)>、<足し算による生成方法(2)>又は<足し算による生成方法(2)>を用いて、生成されるようにしても良い。
図17は、各顕著性ブロック層L1おいて、上述の<足し算による生成方法(1)>を用いて第3特徴マップFM3が生成されるイメージを示す図である。各顕著性ブロック層L1において、図17にてイメージを示しているような第3特徴マップFM3の生成が行われる。
図18は、各顕著性ブロック層L1において、上述の<掛け算による生成方法(1)>を用いて第3特徴マップFM3が生成されるイメージを示す図である。各顕著性ブロック層L1において、図18にてイメージを示しているような第3特徴マップFM3の生成が行われる。
図19は、各顕著性ブロック層L1おいて、上述の<足し算による生成方法(2)>を用いて第3特徴マップFM3が生成されるイメージを示す図である。各顕著性ブロック層L1において、図19にてイメージを示しているような第3特徴マップFM3の生成が行われる。
図20は、各顕著性ブロック層L1において、上述の<掛け算による生成方法(2)>を用いて第3特徴マップFM3が生成されるイメージを示す図である。各顕著性ブロック層L1において、図20にてイメージを示しているような第3特徴マップFM3の生成が行われる。
なお、図21は、各顕著性ブロック層L1において、上述の<足し算による生成方法(3)>によって第3特徴マップFM3が生成されるイメージを示している。
検出対象となる物体(ここでは人)に対応する領域が活性化した個々の第1特徴マップFM1は、次元方向において複数個の第2特徴マップFM2の後に足し合わされる。
<足し算による生成方法(3)>は、第2特徴量に対して第1特徴量を空間的に足すのではなく、特徴マップのバリエーションを増やすことで第2特徴マップFM2に重み付けをすることを目的とした方法である。
例えば、第1特徴マップFM1及び第2特徴マップFM2がそれぞれ500次元の特徴マップであったとする。この場合、例えば、上述の<足し算による生成方法(1)>では、生成される第3特徴マップFM3は、500次元の特徴マップであり、次元方向の数は変わらない。これに対し、<足し算による生成方法(3)>では、生成される第3特徴マップFM3は、1000次元の特徴マップとなる。すなわち、次元方向に特徴マップの数が増える。生成された1000次元の第3特徴マップFM3が次の顕著性ブロック層L1においてさらに畳み込み演算されることで、特徴量のバリエーションがさらに豊かになった第3特徴マップFM3が生成される。
具体的には、例えば、Smallは、32×32ピクセルより小さいサイズの物体を含む範囲であり、Mediumは、32×32ピクセルより大きく、96×96ピクセルより小さいサイズの物体を含む範囲であり、Largeは、96×96ピクセルよりも大きいサイズの物体を含む範囲である。
物体検出装置200は、演算量を削減しつつ、小さい物体の検出に足りる特徴量を取得することができる。物体検出装置200を用いることにより、演算量を低減することができるとともに、小さい物体の検出を実現することができる。
距離画像又はソナー画像は、例えば、距離センサ、ミリ波レーダ、ソナーセンサ、又は、赤外線センサから得られる。温度画像は、例えば、サーマルセンサから得られる。距離センサは、物体までの距離が正しく測定できるため、当該距離センサから得られた距離画像を用いた場合、物体らしさをあらわす第1特徴マップの精度が高くなる。ミリ波レーダは、悪天候時であっても物体までの距離を正確に測定できる。ソナーセンサ又は赤外線センサは、安価で近距離の物体位置を測定できる。サーマルセンサは、夜間の撮影に適している。
第1特徴マップ生成部31は、例えば、距離画像、ソナー画像、又は、温度画像を用いて第1特徴マップを生成することで、上述したように、抽出する特徴に応じた第1特徴マップを生成することができるとともに、プライバシー保護の観点から匿名性の高い第1特徴マップの生成を行うことができる。
ここで、図31は、第1特徴マップ生成部31が、個々の撮像画像に対応する温度画像を用いて生成した、第1特徴マップとしての熱マップのイメージの一例を示す図である。 熱マップは、人に対応する領域が活性化するため、人を検出したい場合の第1特徴マップとして用いられるのに適している。また、温度画像を用いて生成された熱マップは、撮像画像を用いて生成された第1特徴マップと比して、より夜間の人物検出に優れる。
また、例えば、物体検出部24における物体検出は、EfficientDet(以下の参考文献3参照)によるものであっても良い。
Mingxing Tan, Ruoming Pang, Quoc V. Le,"EfficientDet: Scalable and Efficient Object Detection"; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10781-10790
図32は、実施の形態2に係る物体検出装置を含む物体検出システムの要部を示すブロック図である。図32を参照して、実施の形態2に係る物体検出装置を含む物体検出システムについて説明する。なお、図32において、図1に示すブロックと同様のブロックには同一符号を付して説明を省略する。
図35は、実施の形態3に係る物体検出装置を含む物体検出システムの要部を示すブロック図である。図35を参照して、実施の形態3に係る物体検出装置を含む物体検出システムについて説明する。なお、図35において、図1に示すブロックと同様のブロックには同一符号を付して説明を省略する。
図38は、実施の形態4に係るモニタリング装置を含むモニタリングシステムの要部を示すブロック図である。図39は、実施の形態4に係るモニタリング装置における解析部及び出力制御部の要部を示すブロック図である。図38及び図39を参照して、実施の形態4に係るモニタリング装置を含むモニタリングシステムについて説明する。なお、図38において、図1に示すブロックと同様のブロックには同一符号を付して説明を省略する。
Claims (29)
- カメラによる撮像画像を示す画像データを取得する画像データ取得部と、
前記画像データを用いて第1特徴マップを生成する第1特徴量抽出部と、
前記画像データを用いて第2特徴マップを生成するとともに、当該第2特徴マップに対して前記第1特徴マップを用いた足し算又は掛け算を行って、当該第2特徴マップに対する重み付けをすることにより第3特徴マップを生成する第2特徴量抽出部と、
前記第3特徴マップを用いて前記撮像画像における物体を検出する物体検出部と、を備え、
前記第1特徴マップにおける第1特徴量は、物体らしさに対応する中レベル特徴を用いたものである
前記第2特徴マップにおける第2特徴量は、高レベル特徴を用いたものである
ことを特徴とする物体検出装置。 - 前記第2特徴量抽出部は、前記第1特徴マップにおける個々の前記第1特徴量を個々の前記第2特徴マップにおける対応する前記第2特徴量に足し合わせる足し算を行って、前記重み付けを行う
ことを特徴とする請求項1記載の物体検出装置。 - 前記第2特徴量抽出部は、前記第1特徴マップにおける個々の前記第1特徴量を、個々の前記第2特徴マップにおける対応する前記第2特徴量に掛ける掛け算を行って、前記重み付けを行う
ことを特徴とする請求項1記載の物体検出装置。 - 前記第1特徴量抽出部は、前記第1特徴マップから、それぞれ異なる第4特徴量により構成される複数の第4特徴マップを生成し、
前記第2特徴量抽出部は、前記第4特徴マップにおける個々の前記第4特徴量を前記第4特徴マップに対応する前記第2特徴マップにおける対応する前記第2特徴量に足し合わせる足し算を行って、前記重み付けを行う
ことを特徴とする請求項1記載の物体検出装置。 - 前記第1特徴量抽出部は、前記第1特徴マップから、それぞれ異なる第4特徴量により構成される複数の第4特徴マップを生成し、
前記第2特徴量抽出部は、前記第4特徴マップにおける個々の前記第4特徴量を前記第4特徴マップに対応する前記第2特徴マップにおける対応する前記第2特徴量に掛ける掛け算を行って、前記重み付けを行う
ことを特徴とする請求項1記載の物体検出装置。 - 前記第2特徴量抽出部は、前記第1特徴マップを前記第2特徴マップの次元方向に足し合わせる足し算を行って、前記重み付けを行う
ことを特徴とする請求項1記載の物体検出装置。 - 前記第1特徴量抽出部は、教師なし学習により学習自在であることを特徴とする請求項1記載の物体検出装置。
- 前記第2特徴量抽出部は、教師あり学習により学習自在であることを特徴とする請求項1記載の物体検出装置。
- 前記第2特徴量抽出部は、畳み込みニューラルネットワークを用いて前記第2特徴マップを生成することを特徴とする請求項8記載の物体検出装置。
- 前記第2特徴量抽出部は、深層学習により学習自在であることを特徴とする請求項9記載の物体検出装置。
- 前記第1特徴量抽出部が生成する前記第1特徴マップは、前記画像データとしての撮像画像に基づく顕著性マップ、前記画像データとしての距離画像又はソナー画像に基づく深度マップ、及び、前記画像データとしての熱画像に基づくヒートマップのうちの少なくとも一つであることを特徴とする請求項7記載の物体検出装置。
- 前記第2特徴量抽出部は、構造的類似性及び画像類似度相関のうち少なくとも一つに基づいて前記重み付けにおける重要度を設定する
ことを特徴とする請求項2から請求項5記載の物体検出装置。 - 前記重み付けがなされることにより、個々の前記第2特徴マップにおける個々の前記第2特徴量が対応する前記物体らしさに応じて補強されるものであることを特徴とする請求項1記載の物体検出装置。
- 前記物体検出部は、互いに異なるカーネルサイズによる複数回の畳み込み演算を実行することにより前記物体を検出することを特徴とする請求項1記載の物体検出装置。
- 前記物体検出部は、教師あり学習により学習自在であることを特徴とする請求項1記載の物体検出装置。
- 前記物体検出部は、回帰により前記物体の位置を推定するとともに、分類により前記物体の種別を推定することを特徴とする請求項15記載の物体検出装置。
- 前記物体の種別は、前記物体の進行方向を含むことを特徴とする請求項16記載の物体検出装置。
- 時刻情報を取得する時刻情報取得部と、
時刻別学習済みパラメータデータベースに含まれるパラメータセットのうちの前記時刻情報が示す時刻に対応するパラメータセットを選択するパラメータ選択部と、を備え、
前記第2特徴量抽出部は、前記パラメータ選択部により選択されたパラメータセットに含まれる学習済みパラメータを用いて前記第2特徴マップ及び前記第3特徴マップを生成する
ことを特徴とする請求項8記載の物体検出装置。 - 場所情報を取得する場所情報取得部と、
場所別学習済みパラメータデータベースに含まれるパラメータセットのうちの前記場所情報が示す場所に対応するパラメータセットを選択するパラメータ選択部と、を備え、
前記第2特徴量抽出部は、前記パラメータ選択部により選択されたパラメータセットに含まれる学習済みパラメータを用いて前記第2特徴マップ及び前記第3特徴マップを生成する
ことを特徴とする請求項8記載の物体検出装置。 - 請求項1記載の物体検出装置と、
前記物体検出部による検出結果を解析する解析部と、
前記解析部による解析結果に対応する解析結果信号を出力する出力制御部と、
を備えるモニタリング装置。 - 前記解析部は、前記物体の異常度を判定する異常判定部及び前記物体の脅威度を判定する脅威判定部のうちの少なくとも一方を有することを特徴とする請求項20記載のモニタリング装置。
- 前記異常判定部は、前記物体検出部による検出結果が示す前記物体の位置に基づき前記異常度を判定することを特徴とする請求項21記載のモニタリング装置。
- 前記脅威判定部は、前記物体検出部による検出結果が示す前記物体の進行方向に基づき前記脅威度を判定することを特徴とする請求項21記載のモニタリング装置。
- 前記脅威判定部は、前記撮像画像における前記物体のサイズの時間変化量に基づき前記脅威度を判定することを特徴とする請求項21記載のモニタリング装置。
- 前記解析部は、前記物体検出部による検出結果を時間的に解析することにより前記時間変化量を算出する時間解析部を有することを特徴とする請求項24記載のモニタリング装置。
- 前記解析部は、前記異常判定部による判定結果及び前記脅威判定部による判定結果のうちの少なくとも一方を空間的に解析することによりリスクマップを生成する空間解析部を有することを特徴とする請求項21記載のモニタリング装置。
- 前記出力制御部は、前記解析結果信号をディスプレイに出力することにより、前記リスクマップに対応するリスクマップ画像を前記ディスプレイに表示させることを特徴とする請求項26記載のモニタリング装置。
- 学習用画像を示す画像データを取得する画像データ取得部と、
前記画像データを用いて第1特徴マップを生成する第1特徴量抽出部と、
前記画像データを用いて第2特徴マップを生成するとともに、当該第2特徴マップに対して前記第1特徴マップを用いた足し算又は掛け算を行って当該第2特徴マップに対する重み付けをすることにより第3特徴マップを生成する第2特徴量抽出部と、
前記第3特徴マップを用いて前記学習用画像における物体を検出する物体検出部と、
前記物体検出部による検出結果に応じて前記第2特徴量抽出部及び前記物体検出部の学習をする学習部と、を備え、
前記第1特徴マップにおける第1特徴量は、物体らしさに対応する中レベル特徴を用いたものであり、
前記第2特徴マップにおける第2特徴量は、高レベル特徴を用いたものである
ことを特徴とする学習装置。 - 画像データ取得部が、学習用画像を示す画像データを取得するステップと、
第1特徴量抽出部が、前記画像データを用いて第1特徴マップを生成するステップと、
第2特徴量抽出部が、前記画像データを用いて第2特徴マップを生成するとともに、当該第2特徴マップに対して前記第1特徴マップを用いた演算を行って当該第2特徴マップに対する重み付けをすることにより第3特徴マップを生成するステップと、
物体検出部が、前記第3特徴マップを用いて前記学習用画像における物体を検出するステップと、
学習部が、前記物体検出部による検出結果に応じて前記第2特徴量抽出部及び前記物体検出部の学習をして、前記画像データを入力とし前記物体の検出結果を出力する機械学習モデルを生成するステップとを備え、
前記第1特徴マップにおける第1特徴量は、物体らしさに対応する中レベル特徴を用いたものであり、
前記第2特徴マップにおける第2特徴量は、高レベル特徴を用いたものである
ことを特徴とするモデル生成方法。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20966963.9A EP4270301A4 (en) | 2020-12-25 | 2020-12-25 | OBJECT DETECTION DEVICE, MONITORING DEVICE, LEARNING DEVICE AND MODEL GENERATION METHOD |
CN202080108058.1A CN116686001A (zh) | 2020-12-25 | 2020-12-25 | 物体检测装置、监视装置、学习装置以及模型生成方法 |
US18/037,020 US20230410532A1 (en) | 2020-12-25 | 2020-12-25 | Object detection device, monitoring device, training device, and model generation method |
PCT/JP2020/048617 WO2022137476A1 (ja) | 2020-12-25 | 2020-12-25 | 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 |
JP2022570922A JP7361949B2 (ja) | 2020-12-25 | 2020-12-25 | 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/048617 WO2022137476A1 (ja) | 2020-12-25 | 2020-12-25 | 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022137476A1 true WO2022137476A1 (ja) | 2022-06-30 |
Family
ID=82157437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/048617 WO2022137476A1 (ja) | 2020-12-25 | 2020-12-25 | 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20230410532A1 (ja) |
EP (1) | EP4270301A4 (ja) |
JP (1) | JP7361949B2 (ja) |
CN (1) | CN116686001A (ja) |
WO (1) | WO2022137476A1 (ja) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018147431A (ja) * | 2017-03-09 | 2018-09-20 | コニカミノルタ株式会社 | 画像認識装置及び画像認識方法 |
JP2020047270A (ja) * | 2018-09-17 | 2020-03-26 | 株式会社ストラドビジョン | マルチフィーディングを適用した学習方法及び学習装置並びにそれを利用したテスト方法及びテスト装置 |
JP2020113000A (ja) * | 2019-01-10 | 2020-07-27 | 日本電信電話株式会社 | 物体検出認識装置、方法、及びプログラム |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7031081B2 (ja) | 2019-12-25 | 2022-03-07 | 三菱電機株式会社 | 物体検出装置、モニタリング装置及び学習装置 |
-
2020
- 2020-12-25 JP JP2022570922A patent/JP7361949B2/ja active Active
- 2020-12-25 US US18/037,020 patent/US20230410532A1/en active Pending
- 2020-12-25 EP EP20966963.9A patent/EP4270301A4/en active Pending
- 2020-12-25 WO PCT/JP2020/048617 patent/WO2022137476A1/ja active Application Filing
- 2020-12-25 CN CN202080108058.1A patent/CN116686001A/zh active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018147431A (ja) * | 2017-03-09 | 2018-09-20 | コニカミノルタ株式会社 | 画像認識装置及び画像認識方法 |
JP2020047270A (ja) * | 2018-09-17 | 2020-03-26 | 株式会社ストラドビジョン | マルチフィーディングを適用した学習方法及び学習装置並びにそれを利用したテスト方法及びテスト装置 |
JP2020113000A (ja) * | 2019-01-10 | 2020-07-27 | 日本電信電話株式会社 | 物体検出認識装置、方法、及びプログラム |
Non-Patent Citations (4)
Title |
---|
MINGXING TAN, RUOMING PANGQUOC V. LE: "EfficientDet: Scalable and Efficient Object Detection", PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2020, pages 10781 - 10790 |
MINGXING TANQUOC LE: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks", PROCEEDINGS OF THE 36TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, PMLR, vol. 97, 2019, pages 6105 - 6114 |
See also references of EP4270301A4 |
WEI LIUDRAGOMIR ANGUELOVDUMITRU ERHANCHRISTIAN SZEGEDYSCOTT REEDCHENG-YANG FUALEXANDER C. BERG, SSD: SINGLE SHOT MULTIBOX DETECTOR, vol. 5, 29 December 2016 (2016-12-29), Retrieved from the Internet <URL:https://arxiv.org/pdf/1512.02325v5.pdf> |
Also Published As
Publication number | Publication date |
---|---|
JP7361949B2 (ja) | 2023-10-16 |
EP4270301A4 (en) | 2024-01-24 |
US20230410532A1 (en) | 2023-12-21 |
EP4270301A1 (en) | 2023-11-01 |
JPWO2022137476A1 (ja) | 2022-06-30 |
CN116686001A (zh) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102596388B1 (ko) | 이동체의 이동 속성 획득 방법 및 이를 수행하는 장치 | |
Rani | LittleYOLO-SPP: A delicate real-time vehicle detection algorithm | |
US20220375238A1 (en) | Three dimensional (3d) object detection | |
JP2020052694A (ja) | 物体検出装置、物体検出方法及び物体検出用コンピュータプログラム | |
CN110569792A (zh) | 一种基于卷积神经网络的自动驾驶汽车前方物体检测方法 | |
JP2008535038A (ja) | カメラによりシーンに関して取得された映像中の移動物体を追跡する方法 | |
CN113409361B (zh) | 一种多目标跟踪方法、装置、计算机及存储介质 | |
CN116188999B (zh) | 一种基于可见光和红外图像数据融合的小目标检测方法 | |
CN112348116B (zh) | 利用空间上下文的目标检测方法、装置和计算机设备 | |
Singh et al. | Vehicle detection and accident prediction in sand/dust storms | |
Qiao et al. | Marine vessel re-identification: A large-scale dataset and global-and-local fusion-based discriminative feature learning | |
Rashed et al. | Bev-modnet: Monocular camera based bird's eye view moving object detection for autonomous driving | |
Aditya et al. | Collision Detection: An Improved Deep Learning Approach Using SENet and ResNext | |
JP7031081B2 (ja) | 物体検出装置、モニタリング装置及び学習装置 | |
WO2022137476A1 (ja) | 物体検出装置、モニタリング装置、学習装置、及び、モデル生成方法 | |
Al Mamun et al. | Efficient lane marking detection using deep learning technique with differential and cross-entropy loss. | |
Hafeezallah et al. | Multi-Scale Network with Integrated Attention Unit for Crowd Counting. | |
Tourani et al. | Challenges of video-based vehicle detection and tracking in intelligent transportation systems | |
Zhang et al. | LanePainter: lane marks enhancement via generative adversarial network | |
Schennings | Deep convolutional neural networks for real-time single frame monocular depth estimation | |
SR | OBJECT DETECTION, TRACKING AND BEHAVIOURAL ANALYSIS FOR STATIC AND MOVING BACKGROUND. | |
Athikam et al. | Road Navigator Identification using Deep Learning Techniques | |
Kovačić et al. | Measurement of road traffic parameters based on multi-vehicle tracking | |
KR102454878B1 (ko) | 이동체의 이동 속성 획득 방법 및 이를 수행하는 장치 | |
Pandya et al. | A novel approach for vehicle detection and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20966963 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022570922 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202080108058.1 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020966963 Country of ref document: EP Effective date: 20230725 |