WO2023190644A1 - Performance indexing device, performance indexing method, and program - Google Patents

Performance indexing device, performance indexing method, and program Download PDF

Info

Publication number
WO2023190644A1
WO2023190644A1 PCT/JP2023/012736 JP2023012736W WO2023190644A1 WO 2023190644 A1 WO2023190644 A1 WO 2023190644A1 JP 2023012736 W JP2023012736 W JP 2023012736W WO 2023190644 A1 WO2023190644 A1 WO 2023190644A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
likelihood
detection
image
value
Prior art date
Application number
PCT/JP2023/012736
Other languages
French (fr)
Japanese (ja)
Inventor
洋一 小倉
晋也 松山
健志 緑川
直大 岩橋
肇 片山
Original Assignee
ヌヴォトンテクノロジージャパン株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヌヴォトンテクノロジージャパン株式会社 filed Critical ヌヴォトンテクノロジージャパン株式会社
Publication of WO2023190644A1 publication Critical patent/WO2023190644A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present invention relates to a performance indexing device, a performance indexing method, and a program for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies.
  • AI artificial intelligence
  • a model of neurons in the human brain and a wide variety of models have been developed to detect objects from images.
  • the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see Patent Document 1).
  • the present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies.
  • the purpose is to provide devices, performance indexing methods, and programs.
  • a performance indexing device is a performance indexing device for an object detection model, and includes an image processing unit that acquires an image and processes it appropriately, and a system that processes various types of images acquired by the image processing unit.
  • an object detection model including a model preprocessing means for processing a plurality of images according to processing parameters; and a model learning dictionary for inferring an object position and likelihood with respect to input of the plurality of images processed by the model preprocessing means; , based on the inference result of the object detection model, position information and first likelihood information including the first detection frame for each detected object in the plurality of images are set to a second detection frame having an appropriate value.
  • a model post-processing means for correcting the position information including the second likelihood information and the second likelihood information, and the position information including the second detection frame which is the output result of the model post-processing means, the second likelihood information and the various types of and robustness verification means for verifying the robustness of the object detection model based on the processing parameters.
  • the performance indexing method includes an image processing step of acquiring and appropriately processing an image, and a model of processing the image acquired in the image processing step into a plurality of images according to various processing parameters.
  • a preprocessing step a preprocessing step
  • an object detection model including a model learning dictionary that infers object positions and likelihoods based on the input of the plurality of images processed in the model preprocessing step
  • an inference result of the object detection model Based on the position information including the first detection frame and the first likelihood information for each detection object in the plurality of images, the position information including the second detection frame and the second likelihood, which are appropriate values.
  • a model post-processing step for correcting information, and detecting the object based on position information including a second detection frame that is an output result of the model post-processing step, second likelihood information, and the various processing parameters. and a robustness verification step of verifying the robustness of the model.
  • a program according to one aspect of the present invention is a program for causing a computer to execute the performance indexing method described above.
  • a performance indexing device for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. is provided.
  • FIG. 1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention
  • FIG. FIG. 2 is a diagram showing the configuration of an artificial neuron model.
  • FIG. 2 is a diagram illustrating the configuration of a YOLO model according to an embodiment.
  • FIG. 3 is a diagram illustrating the operating principle of the YOLO model according to an embodiment. It is a figure showing the calculation concept of the IOU value in object detection.
  • FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention.
  • FIG. 1 is a diagram illustrating problems of a conventional object detection model performance indexing device.
  • FIG. 2 is a second diagram illustrating problems with a conventional object detection model performance indexing device.
  • FIG. 6 is a diagram illustrating the operation of the position shifting function of the model preprocessing means according to an embodiment of the invention.
  • FIG. 6 is a diagram illustrating the operation of the resizing function of the model preprocessing means according to the embodiment of the present invention.
  • FIG. 1 is a diagram illustrating problems of a conventional object detection model performance indexing device.
  • FIG. 2 is a second diagram illustrating problems with a conventional object detection model performance indexing device.
  • FIG. 6 is a diagram illustrating the operation of the position shifting function
  • FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention.
  • FIG. 6 is a diagram showing the operation of the tone conversion function of the model preprocessing means according to the embodiment of the present invention.
  • FIG. 6 is a diagram illustrating the operation of the aspect ratio changing function of the model preprocessing means according to the embodiment of the present invention.
  • FIG. 6 is a diagram illustrating the operation of the rotation function of the model preprocessing means according to an embodiment of the invention.
  • FIG. 1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention
  • FIG. FIG. 2 is a diagram showing a conventional object detection model performance indexing device.
  • FIG. 2 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention.
  • AI artificial intelligence
  • image eye information
  • class identification class identification
  • one of the indicators indicating the detection reliability of the target object is the reliability shown in (Equation 1) below.
  • There is a degree score (for example, see Non-Patent Document 1).
  • the confidence score is sometimes commonly referred to as likelihood.
  • Object) indicates the class probability to which class the Object (target object) belongs, and the sum of all class probabilities is "1".
  • Pr(Object) indicates the probability that an Object is included in a BoundingBox (hereinafter referred to as BBox).
  • IOUTtruth pred is an index indicating how much the two frame areas of ground truth BBox, which is the correct frame information, and BBox predicted (inferred) by a model such as YOLO overlap, and IOU( Intersection Over Union) value.
  • IOU Area of Union ⁇ Area of Intersection (Formula 2)
  • Area of Union is the area of the union of the two frame areas to be compared.
  • Area of Intersection is the area of the common portion of the two frame regions to be compared.
  • mAP mean average precision
  • AP average precision
  • mAP and AP in object detection are calculated by the following method.
  • the Precision and Recall values are calculated from "0", which is the minimum probability, to "1", which is the maximum probability that the above-mentioned Object is included in the BBox for each class to be identified.
  • the sum of the areas of the two-dimensional graph is calculated as AP, and the average of the APs calculated for all identification classes is calculated as mAP.
  • FIG. 17 is a block diagram showing a conventional performance indexing device for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
  • the image processing means 100 that acquires and appropriately processes images receives light emitted from a lens (for example, standard zoom, wide-angle zoom, fisheye) and an object passing through the lens, and converts the brightness of the light into electrical information.
  • the image sensor which is a device that converts into Equipped with an image processing processor equipped with a correction function and a local tone mapping function, it makes it easy to see or find the object to be detected while absorbing time-series fluctuation conditions such as illuminance in the shooting environment. Perform image processing.
  • the image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc.
  • a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc.
  • model preprocessing means 200 may be configured with an electronic circuit, or may be realized by an image processing processor 290 configured with an affine transformation function 291, a projective transformation function 292 (library), and a CPU or an arithmetic processor. be.
  • the image processed by the model preprocessing means 200 is input to the object detection model 300, and by inference (prediction), it is detected where the target object is, and whether the object is a person or a vehicle. It is identified whether it corresponds to a class (class identification).
  • class identification class identification
  • position information 301 including zero or multiple first detection frames including undetectable and false detection position information 301 including zero or multiple first detection frames including undetectable and false detection
  • first likelihood information 302 is output.
  • the position information 301 including the first detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame
  • the first likelihood information 302 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
  • the object detection model 300 includes, for example, a model learning dictionary 320 and a deep neural network (DNN) model 310 using a convolutional neural network (CNN).
  • the DNN model 310 may use, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with high superiority in detection processing speed.
  • YOLO for example, see Non-Patent Document 1
  • SSD etc.
  • FasterR-CNN EfficientDet, or the like
  • MobileNet when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used.
  • the model learning dictionary 320 is a collection of weighting coefficient data of the DNN model 310, and in the case of the DNN model 310, it is initially learned or re-learned by the deep learning means 640.
  • position information 301 including zero or multiple first detection frames including undetectable and false detection and first likelihood information 302 are , after inputting it to the model post-processing means 400, the most appropriate one for each detection object is selected by sorting the position information 301 including the first detection frame based on mutual IOU values, determining the maximum of the first likelihood information 302, etc.
  • the position information 401 including the possible second detection frame and the second likelihood information 402 are corrected and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc. be done.
  • the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame
  • the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
  • a series of means for generating position information 401 including the second detection frame and second likelihood information 402 by these image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This is a first performance indexing device 30 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
  • learning material data considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data for deep learning such as large-scale open source datasets are stored.
  • material data for learning necessary images depending on the purpose of use are utilized, for example, image data that is displayed from the image processing means 100 using the image output control means 110 and stored in the data storage means 120. In some cases.
  • the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data.
  • the supervised data generated by the annotation means 620 is augmented by the augmentation means 630 as a learning image 631 in order to enhance versatility and robustness.
  • the learning image 631 is input to the deep learning means 640, the weighting coefficient of the DNN model 310 is calculated, and the calculated weighting coefficient is converted into, for example, ONNX format to create the model learning dictionary 320.
  • the model learning dictionary 320 By reflecting the model learning dictionary 320 in the object detection model 300, it becomes possible to detect the position of the object in the image and identify the class.
  • Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610.
  • Validation material data is an image output from a large-scale open source dataset or image processing means 100, for example, to verify the detection accuracy, detection performance, versatility, and robustness required for the purpose of use.
  • the control means 110 is used to display the image data and the image data stored in the data storage means 120 is utilized.
  • the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623.
  • the validation data 623 is input to the first mAP calculation means 660 that is capable of inference (prediction) equivalent to that of the object detection model 300, and the mAP is calculated as a result of inference (prediction) with the groundtruth BBox that is the correct answer frame.
  • IOU value 653 by comparing Predicted BBox (predicted BBox), and calculation of Precision 654 which indicates the percentage of all prediction results for all validation data 623 where IOU value 653 was correctly predicted at or above an arbitrary threshold value.
  • Recall 655 which indicates the proportion of BBoxes near the correct result whose IOU value 653 is greater than or equal to an arbitrary threshold, can be predicted, and the above-mentioned object detection accuracy and performance are compared.
  • AP (Average Precision) value 651 for each class as an index and mAP (mean Average Precision) value 652 averaged over all classes are calculated (for example, see Non-Patent Document 2).
  • the first mAP calculation means 660 is equipped with an open source inference environment called darknet and an arithmetic processor (including a personal computer and a supercomputer).
  • the object detection model 300 has the same inference (prediction) performance as the object detection model 300. Furthermore, it is provided with means for calculating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 described above.
  • a series of means for generating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 by the learning material database storage means 610, the annotation means 620, and the first mAP calculation means 660 is the This is a second performance indexing device 40 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that performs position detection and class identification.
  • the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see, for example, Patent Document 1).
  • Versatility and robustness items and various fluctuation conditions for a model that detects objects in images acquired by a camera etc. include the background (scenery), camera lens specifications, image size, such as the height and elevation/depression angle at which the camera is mounted. Detection target area and field of view, dewarp processing method when using a fisheye lens, special conditions such as changes in illuminance depending on sunlight and lighting, crushed shadows, blown highlights, backlight, etc., sunny, cloudy, rain, snow , weather conditions such as fog, position (left, right, top, bottom, and depth) of the target detection object in the image, size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, and mutual overlap. These include the state, the type, size, and position of the attached object, whether or not the lens has IR cut, the moving speed of the object to be detected, and the moving speed of the camera itself.
  • a performance indexing method hereinafter sometimes simply referred to as a method
  • a program that generates an image When the position and size of the detection target fluctuates over time, even though the same object is being detected, the detection frame is inferred (predicted) due to the configuration conditions of the DNN model and issues caused by the algorithm.
  • Variations may occur in specific patterns in location information and likelihood information, including This phenomenon occurs when cameras for detecting objects are made smaller, consume less power, and reduce costs due to limitations in the performance of arithmetic processors such as DSPs (digital signal processors) installed in DNN models. This problem is thought to be particularly noticeable when the input image size is reduced.
  • DSPs digital signal processors
  • YOLO which is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class
  • the detection Depending on the location of the object, there may be locations where the likelihood decreases in a unique grid pattern. For example, in the case of YOLO, this occurs because the area is divided into grid cells of arbitrary size and class probabilities are calculated in order to detect the object position and identify the class at the same time, as shown in Figure 3B. This is considered a potential issue.
  • the second performance indexing device 40, method, and program are used to analyze the robustness and reinforcement policy of a model learning dictionary for a model that detects the position and class of an object in an image.
  • indexing performance it is possible to understand the overall and average detection accuracy and detection performance for the validation data selected for verification, but it is important to understand in detail the versatility and robustness against various fluctuation conditions. I can't.
  • the first mAP calculation means 660 in FIG. 17 the first When the performance indexing device 30, method, and program of Conditions that require reinforcement will not be fully understood. Therefore, when a model learning dictionary is learned by deep learning or the like, improvements in versatility and robustness against various fluctuation conditions may not be sufficient.
  • the present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies.
  • the purpose is to provide devices, methods, and programs. Furthermore, in order to reduce the size, power consumption, and cost of cameras for detecting objects, even if performance limitations are placed on the arithmetic processors such as DSPs (digital signal processors) installed, The purpose is to provide a performance indexing device, method, and program to ensure detection accuracy and performance.
  • a performance indexing device is a device that performs performance indexing in an object detection model, and includes an image processing means for acquiring and appropriately processing an image, and an image processing means for acquiring an image and appropriately processing the image.
  • a model preprocessing means for processing an image into a plurality of images according to various processing parameters, and a model learning for inferring an object position and likelihood (degree of certainty) for the plurality of images processed by the model preprocessing means.
  • An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model.
  • model post-processing means for correcting position information including a second detection frame and second likelihood information; and position information including the second detection frame and second likelihood that are output results of the model post-processing unit.
  • the present invention is characterized by comprising a robustness verification means for verifying the robustness of the object detection model based on the information and the various processing parameters.
  • a performance indexing device is the performance indexing device according to the first aspect, wherein the model preprocessing means processes the plurality of images input to the object detection model.
  • the model preprocessing means processes the plurality of images input to the object detection model.
  • a performance indexing device is the performance indexing device according to the first or second aspect, wherein the model preprocessing means When processing an image of In the (pixel) step, a total of N ⁇ M ⁇ L position-shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction.
  • a performance indexing device is the performance indexing device according to any one of the first to third aspects, wherein the model preprocessing means When processing the plurality of images input to the , the brightness level is set to an arbitrary value using P (arbitrary integer) types of contrast correction curves or gradation conversion curves as the various processing parameters. Generate a modified image.
  • a performance indexing device is the performance indexing device according to any one of the first to fourth aspects, wherein the model preprocessing means When processing the plurality of images input to the image processing apparatus, Q (arbitrary integer) types of aspect ratios are further used as the various processing parameters to generate images with changed aspect ratios.
  • a performance indexing device is the performance indexing device according to any one of the first to fifth aspects, wherein the model preprocessing means When processing the plurality of images that are input to the image processing apparatus, R (arbitrary integer) types of angles are further used as the various processing parameters to generate images with changed rotation angles.
  • a performance indexing device is the performance indexing device according to any one of the first to sixth aspects, wherein the model preprocessing means When processing the plurality of images input to the computer, an image is generated by pasting the average luminance level of the effective images in a blank space where no effective images are generated due to the processing.
  • a performance indexing device is the performance indexing device according to any one of the first to seventh aspects, wherein the model post-processing means For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and Indicates how much an arbitrary threshold T (arbitrary decimal number) for the first likelihood information overlaps with the region of position information including the first detection frame with respect to the first likelihood information. Correct the position information including the second detection frame with the maximum likelihood and the second likelihood information for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index. It is characterized by having individual identification means.
  • a performance indexing device is the performance indexing device according to any one of the first to eighth aspects, wherein the model post-processing means If there is positional information and class identification information that include a correct detection frame, the function corrects the positional information that includes the correct detection frame according to the contents of the various processing parameters, and For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and For the first likelihood information, an arbitrary threshold T (arbitrary decimal number) for the first likelihood information, position information including the correct detection frame, and position information including the first detection frame.
  • T arbitrary decimal number
  • the maximum likelihood position information including the second detection frame for each detected object is calculated using an arbitrary threshold value U (an arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index showing how much the regions overlap. It is characterized by comprising an individual identification means for correcting the second likelihood information.
  • U an arbitrary decimal number
  • IOU Intersection over Union
  • a performance indexing device is the performance indexing device according to the eighth or ninth aspect, wherein the model post-processing means is configured to The various processing parameters used in image processing and the output results of the individual identification means are individually linked for each detected object and output to the robustness verification means.
  • a performance indexing device includes any one of the second to tenth aspects that cite the second aspect, or the third to tenth aspects that cite the third aspect.
  • the performance indexing device according to any one aspect, wherein the robustness verification means includes position information including the second detection frame, which is an output result of the model post-processing means, and the second likelihood information.
  • a likelihood distribution indicating the variation accompanying the position shift for each of the detected objects for each of the various processing parameters based on the likelihood in the above, and an average likelihood that is the average value of the effective area of the likelihood;
  • a histogram of the likelihood, the standard deviation of the likelihood which is the standard deviation of the valid area of the likelihood, the maximum likelihood which is the maximum value of the valid area of the likelihood, and the minimum value of the valid area of the likelihood is characterized by comprising probability statistical calculation means for calculating either or all of the minimum likelihood and the IOU value corresponding to the likelihood.
  • a performance indexing device may be any one of the second to eleventh aspects that cite the second aspect, or the third to 11th aspects that cite the third aspect.
  • the robustness verification means detects the The IOU value of the position information including the second detection frame which is the output result of the model post-processing means, the position information including the correct detection frame, the class identification information in the second likelihood information, and the IOU distribution showing the variation due to the position shift for each detected object with respect to the IOU value and the class identification correct answer rate for each of the various processing parameters, based on the class identification correct answer rate calculated from the class identification information that is the correct answer.
  • an average IOU value and an average class identification correct answer rate that are the average values of the effective area of the IOU value and the class identification correct answer rate, a histogram of the IOU value and a histogram of the class identification correct answer rate, The standard deviation of the IOU value, which is the standard deviation of the effective area of the IOU value and the class identification correct answer rate, the standard deviation of the class identification correct answer rate, and the maximum value, which is the maximum value of the effective area of the IOU value and the class identification correct answer rate.
  • the probability statistical calculation means calculates either or all of the IOU value and the maximum class identification accuracy rate, the minimum IOU value and the minimum class identification accuracy rate that are the minimum values of the effective area of the IOU value and the class identification accuracy rate. It is characterized by having the following.
  • a performance indexing device is a performance indexing device according to the eleventh or twelfth aspect that cites the eleventh aspect, and the robustness verification means further comprises: For each processing parameter, extraction of a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, extraction of the detected object where the average likelihood is below an arbitrary threshold value, and extraction of the detected object where the average likelihood is below an arbitrary threshold value, and Extraction of the detected object whose standard deviation is greater than or equal to an arbitrary threshold, extraction of the detected object whose maximum likelihood is equal to or less than an arbitrary threshold, and extraction of the detected object whose minimum likelihood is equal to or less than an arbitrary threshold.
  • the present invention is characterized by having a learning reinforcement necessary item extracting unit that extracts any or all of the detected objects whose IOU value is equal to or less than an arbitrary threshold value.
  • a performance indexing device is a performance indexing device according to the twelfth or thirteenth aspect that cites the twelfth aspect, and the robustness verification means further comprises: For each of the various processing parameters, extracting a position or area where the IOU distribution for each detected object is below an arbitrary threshold value, extracting a position or area where the class identification accuracy rate distribution is below an arbitrary threshold value, and extracting the average IOU extraction of the detected object whose value is equal to or less than an arbitrary threshold; extraction of the detected object whose average class identification accuracy rate is equal to or less than the arbitrary threshold; and detection of the detected object whose standard deviation of the IOU value is equal to or greater than the arbitrary threshold.
  • the present invention is characterized by having a means for extracting necessary items for learning reinforcement that includes any or all of the extraction methods.
  • a performance indexing device is the performance indexing device according to the fourteenth aspect, in which the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item
  • the extraction means is configured to perform a probability statistical calculation based on the likelihood, the IOU value, and the class classification correct answer rate for an image in which pixels related to the target detection object are missing at an arbitrary rate. It is characterized by having a function to exclude it from calculation targets.
  • a performance indexing device is the performance indexing device according to any one of the thirteenth to fifteenth aspects, which performs analysis based on the output of the probability statistical calculation means.
  • a learning image is prepared based on the result of the learning reinforcement necessary item extraction means, and the model learning dictionary is used by the built-in or external dictionary learning means. It is characterized by relearning.
  • a performance indexing device is the performance indexing device according to any one of the first to sixteenth aspects, wherein the object detection model is created by deep learning. It is characterized by being a neural network that includes a model learning dictionary.
  • a performance indexing method is a method for creating a performance index in an object detection model, which includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image.
  • a model preprocessing step for processing an image into multiple images according to various processing parameters, and model learning for inferring object positions and likelihoods (degrees of certainty) for the multiple images processed in the model preprocessing step.
  • An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model.
  • the method is characterized in that it includes a robustness verification step of verifying the robustness of the object detection model based on information and the various processing parameters, and the method executes each of the means as steps.
  • a performance indexing program is a program for causing a computer to perform performance indexing in an object detection model, and includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image.
  • a model preprocessing step for processing the image acquired in the step into a plurality of images according to various processing parameters, and an object position and likelihood (degree of certainty) for the plurality of images processed in the model preprocessing step.
  • an object detection model including a model learning dictionary that infers the object detection model; and position information including a first detection frame and first likelihood information for each detected object in the plurality of images based on the inference result of the object detection model.
  • a model post-processing step for correcting position information including a second detection frame that is an appropriate value and second likelihood information; and position information including the second detection frame that is an output result of the model post-processing step.
  • a robustness verification step of verifying the robustness of the object detection model based on the second likelihood information and the various processing parameters; It is characterized by being a program.
  • the image from the image processing means is used as a basic image, and the model preprocessing means uses it as various processing parameters in S (arbitrary decimal) pixel (pixel) steps.
  • a total of N ⁇ M position shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction, and for each of the multiple images.
  • the model post-processing means corrects it so that individual identification is possible, and the robustness verification means calculates the likelihood distribution for the position of each detected object.
  • the robustness verification means further includes a likelihood distribution indicating variations due to a position shift for each detected object, an average likelihood that is an average value of a valid region of likelihood, and a histogram of likelihoods.
  • the standard deviation of likelihood which is the standard deviation of the valid region of likelihood
  • the maximum likelihood which is the maximum value of the valid region of likelihood
  • the minimum likelihood which is the minimum value of the valid region of likelihood
  • the robustness verification means when position information including a detection frame that is a correct answer and class identification information that is a correct answer exist, the robustness verification means further calculates the IOU distribution and class Identification accuracy rate distribution, average IOU value, average class identification accuracy rate, histogram of IOU values, histogram of class identification accuracy rate, standard deviation of IOU value, standard deviation of class identification accuracy rate, maximum IOU value , the maximum class identification accuracy rate, the minimum IOU value, and the minimum class identification accuracy rate, or all of them. It becomes possible to extract a feature in which position information including the detection frame and class identification information fluctuate due to fluctuations in the position of the detected object on the screen.
  • the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item extraction means further perform the probability statistical calculation based on the likelihood, IOU value, and class classification correct answer rate.
  • the position and model of the object in the image can be used as the basis for verification. Even in cases where the effective range of the object to be detected is missing depending on the position of the object after processing various processing parameters of the preprocessing means, the performance and features of the accurate object detection model and the versatility and robustness of the model learning dictionary It becomes possible to verify the gender. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary.
  • the model preprocessing means further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnifications as various processing parameters, and then generates the above-mentioned position shifted image.
  • the robustness verification means equipped with the probability statistical calculation means calculates the likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, and the standard deviation of the likelihood for the position of each detected object for each of the L types of sizes. It becomes possible to check the maximum likelihood, minimum likelihood, and IOU value.
  • P (arbitrary integer) types of contrast correction curves or gradation conversion curves are used to generate an image in which the brightness level is changed to an arbitrary value.
  • the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object for each of P (arbitrary integer) types of contrast correction curves or gradation conversion curves are determined by a robustness verification means equipped with a probability statistical calculation means. It becomes possible to check the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value.
  • the distribution, histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate You can check the histogram, standard deviation, maximum value, and minimum value. Therefore, it is possible to improve the DNN model and strengthen the versatility and robustness of the model learning dictionary with respect to the brightness levels of the detected object and background that change depending on the weather conditions, shooting time, and illuminance conditions of the shooting environment.
  • the robustness verification means equipped with the probability statistical calculation means
  • the likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the likelihood distribution for each detected object position for each aspect ratio of Q (arbitrary integer) types It becomes possible to check the IOU value.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. it becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various aspect ratios of the detected object and to enhance the versatility and robustness of the model learning dictionary.
  • the model preprocessing means uses R (arbitrary integer) types of angles as various processing parameters to generate images with changed rotation angles
  • the robustness verification means equipped with probability statistical calculation means (Arbitrary integer) Likelihood distribution for the position of each detected object by type of angle, average likelihood of the valid area of likelihood, histogram of likelihood, standard deviation of likelihood, maximum likelihood, minimum likelihood, and IOU value It becomes possible to confirm.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate It becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various inclinations of the detected object and to enhance the versatility and robustness of the model learning dictionary.
  • the model post-processing means uses a series of means for individually linking each output result and various processing parameters for each detected object and outputting them to the robustness verification means. For this purpose, it becomes possible to extract features whose likelihood changes due to fluctuations in the position of the detected object in the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently.
  • training images are prepared and the model learning dictionary is retrained by the built-in or external dictionary learning means, thereby determining the area near the detected object.
  • various processing parameters other than position shift in any range position such as left, right, top, bottom, and depth of an object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.
  • deep learning etc. It is possible to accurately understand weaknesses and reinforcement policies in generality and robustness against various fluctuation conditions caused by the model learning dictionary that is created, by separating them from potential issues with neural networks themselves, including DNN models. Become. Therefore, effective learning image data and supervised data can be applied to deep learning and the like. Therefore, it is possible to enhance the versatility and robustness of the model learning dictionary.
  • the model preprocessing means when a plurality of images to be input to an object detection model are processed by the model preprocessing means, the average brightness level of the valid images is pasted to the blank area where no valid images exist due to the processing.
  • the model post-processing means further includes position information including zero or a plurality of first detection frames including undetectable and false detection frames and a first likelihood for each detection object present in the image.
  • IOU Intersection over By having an individual identification means that corrects the position information and second likelihood information including the second detection frame with the maximum likelihood for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the union) value, abnormal data can be detected. It is possible to correct the position information and likelihood information including the detection frame for each detected object to appropriate information, so the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object can be corrected.
  • the model post-processing means selects the correct detection frame according to the contents of various processing parameters. It has a function to correct the position information contained in the image, and for each detection object present in the image, it corrects the position information and first likelihood information including zero or multiple first detection frames including undetectable and false detection. is an index that indicates how much overlap between an arbitrary threshold value T (an arbitrary decimal number) for the first likelihood information and the area of the positional information including the correct detection frame and the area of the positional information including the first detection frame.
  • T an arbitrary decimal number
  • each detected object has an individual identification means that corrects each detected object to position information and second likelihood information including a second detection frame with the maximum likelihood based on an arbitrary threshold value U (arbitrary decimal number) for a certain IOU (Intersection over Union) value.
  • U arbitrary decimal number
  • IOU Intersection over Union
  • the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
  • the robustness verification means further extracts, for each processing parameter, a position or region where the likelihood distribution for each detected object is below an arbitrary threshold, and where the average likelihood is below an arbitrary threshold. Extraction of detected objects, extraction of detected objects whose standard deviation of likelihood is greater than or equal to an arbitrary threshold, extraction of detected objects whose maximum likelihood is less than or equal to an arbitrary threshold, and extraction of detected objects whose minimum likelihood is less than or equal to an arbitrary threshold.
  • the robustness verification means further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and selects an arbitrary value in the class identification accuracy rate distribution for each detected object, for each various processing parameters.
  • FIG. 1 is a block diagram showing a performance indexing device 10 for detecting objects in images according to Embodiment 1 of the present invention.
  • each means, each function, and each process described in Embodiment 1 of the present invention described later may be replaced with a step, and each device may be replaced with a method.
  • each means and each device described in Embodiment 1 of the present invention may be realized as a program executed by a computer.
  • the image processing means 100 that acquires and appropriately processes images includes a lens 101, an image sensor 102 that is a device that receives light emitted from an object through the lens, and converts the brightness of the light into electrical information. Equipped with black level adjustment function, HDR (high dynamic range) composition function, gain adjustment function, exposure adjustment function, defective pixel correction function, shading correction function, white balance function, color correction function, gamma correction function, local tone mapping function, etc.
  • the main component is an image processing processor 103. Further, functions other than those described above may also be provided.
  • the lens 101 may be, for example, a standard zoom lens, a wide-angle zoom lens, a fisheye lens, or the like, depending on the purpose of object detection.
  • time-series fluctuation conditions such as illuminance are detected and controlled by various functions installed in the image processing processor 103, and the object to be detected is detected while suppressing fluctuations. Apply image processing to make it easier to see or find.
  • the image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor device, an external memory such as a PC (personal computer), a cloud server, etc. Ru.
  • the image output control means 110 may have a function of transmitting image data according to horizontal and vertical synchronization signals of the display and data storage means 120, for example.
  • the image output control means 110 also refers to the position information 401 including the second detection frame, which is the output result of the model post-processing means 400, and the second likelihood information 402, so as to mark the detected object. It may also have a function of superimposing frame depiction and likelihood information on the output image. Further, the position information 401 including the second detection frame and the second likelihood information 402 are directly transmitted to the display and data storage means 120 using a serial communication function, a parallel communication function, or a UART that converts both. There may be.
  • the image data generated by the image processing means 100 is input to the model preprocessing means 200, and the model is processed so that the image is suitable for input to the object detection model 300.
  • the input image is processed into an input image 210.
  • the object detection model 300 is a model that performs object detection using image data with only brightness levels
  • the image for object detection generated by the image processing means 100 has only brightness levels.
  • the object detection model 300 is a model that performs object detection using color image data including color information
  • the object detection model generated by the image processing means 100 may be converted into brightness data.
  • the image may be color image data having pixels such as RGB.
  • the object detection model 300 is a model that performs object detection using image data of only the brightness level, and the image for object detection generated by the image processing means 100 is A case where the luminance data is converted into luminance data having only levels will be explained.
  • the model preprocessing means 200 may be configured with electronic circuits such as adders, subtracters, multipliers, dividers, and comparators, or may be configured with functions (library ) or a fisheye lens to an image equivalent to a human visual field, and an image processing processor 290 comprising a CPU or an arithmetic processor. Note that the image processing processor 290 may be replaced by the image processing processor 103 included in the image processing means 100.
  • the model preprocessing means 200 uses the above-mentioned affine transformation function 291, projective transformation function 292, image processing processor 290, or electronic circuit to have a function of cutting out a specific area, and a function of cutting out an image when cutting out a specific area.
  • a position shift function 220 for shifting the image to an arbitrary position in the horizontal and vertical directions, a resizing function 230 for enlarging or reducing the image to an arbitrary magnification, and a rotation function 240 for rotating the image to an arbitrary angle.
  • An aspect ratio change function 250 for arbitrarily changing the ratio between the horizontal and vertical directions, a gradation conversion function 260 for changing the brightness level with an arbitrary curve, and a dewarp function for performing distortion correction, cylindrical conversion, etc. It may include some or all of the function 270 and a margin padding function 280 for padding an area where no valid pixels exist with an arbitrary brightness level.
  • the model preprocessing means 200 uses the image data generated by the image processing means 100 as a reference image, and processes it using various processing parameters 510 according to the purpose of creating a performance index. , is processed into a plurality of model input images 210 and output to the object detection model 300, and its usage and operation will be explained in the explanation of the robustness verification means 500 described later.
  • the object detection model 300 is a model that performs object detection using image data of only the brightness level
  • the object detection model 300 is a model for object detection generated by the model preprocessing means 200. A case will be described in which the input image 210 is converted into luminance data having only luminance levels.
  • the image processed by the model preprocessing means 200 is input to the object detection model 300, and by inference (prediction), it is detected where the target object is, and whether the object is a person or a vehicle. It is identified whether it corresponds to a class (class identification).
  • class identification class identification
  • position information 301 including zero or multiple first detection frames including undetectable and false detection position information 301 including zero or multiple first detection frames including undetectable and false detection
  • first likelihood information 302 is output.
  • the position information 301 including the first detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame
  • the first likelihood information 302 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
  • the object detection model 300 is, for example, a deep neural network (DNN) model 310 that uses a model learning dictionary 320 and a convolutional neural network (CNN), which is a model of human brain neurons. Consists of.
  • the DNN model 310 uses, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with a high advantage in detection processing speed.
  • YOLO for example, see Non-Patent Document 1
  • SSD etc.
  • FasterR-CNN, EfficientDet, or the like may be used, for example.
  • MobileNet when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used.
  • FIG. 2 shows a schematic configuration of an artificial neuron model 330 and a neural network 340, which are the basic configuration of the CNN described above.
  • the artificial neuron model 330 receives the output signals of one or more neurons such as X0, An output to the next neuron is generated through an activation function 350 for the sum of the multiplication results.
  • b is a bias (offset).
  • a collection of these many artificial neuron models is a neural network 340.
  • the neural network 340 is composed of an input layer, an intermediate layer, and an output layer, and the output of each artificial neuron model 330 is input to each artificial neuron model 330 at the next stage.
  • the artificial neuron model 330 may be realized by hardware such as an electronic circuit, an arithmetic processor, and a program.
  • the weighting coefficients of each artificial neuron model 330 are calculated as dictionary data using deep learning.
  • the dictionary data that is, the model learning dictionary 320 shown in FIG. It is something that is initially learned or re-learned.
  • the activation function 350 needs to be a non-linear transformation, since repeating a linear transformation only transforms it into a linear transformation.
  • the activation function 350 is a step function that simply identifies "0" or "1", a sigmoid function 351, a ramp function, etc.;
  • a ramp function such as ReLU (Rectified Linear Unit) 352 is often used because the calculation speed decreases.
  • ReLU352 is a function whose output value is always 0 when the input value to the function is less than or equal to 0, and whose output value is the same as the input value when the input value is greater than 0.
  • Leaky ReLU Leaky Rectified Linear Unit
  • LeakyReLU353 multiplies the input value by ⁇ if the input value is lower than 0 ( ⁇ multiplication is, for example, 0.01 times (basic)), and if the input value is higher than 0, the output value is the same value as the input value. This is the function.
  • Other activation functions 350 include a softmax function that is used when identifying the class of a detected object, and a suitable function is used depending on the purpose of use. The softmax function converts and outputs a plurality of output values so that the sum total becomes 1.0 (100%).
  • 3A and 3B are examples of the configuration of a YOLO model 360, which is one of the DNN models 310.
  • the YOLO model 360 shown in FIG. 3A may have, for example, a horizontal pixel Xi and a vertical pixel Yi as the input image size.
  • Convolution layers 370 to 387 that can compress and extract region-based feature amounts by convolving the region of surrounding pixels by filtering, and Pooling that functions to absorb positional deviation of the filter shape in the input image.
  • the basic configuration may be layers 390 to 395, a fully connected layer, and an output layer.
  • the upsampling layer 364 and 365 for upsampling using deconvolution may also be used.
  • the model input image size, the pixel size of the convolution layer, pooling layer, detection layer, upsampling layer, etc., the number and combination of various layers, the number and arrangement of detection layers, etc., depend on the intended use. It may be increased, decreased, or changed.
  • the Convolution layers 370 to 387 correspond to models of simple cells that respond to a specific shape or various shapes, and are used to recognize objects with complex shapes.
  • the Pooling layers 390 to 395 correspond to models of complex cells that function to absorb spatial deviations in shape, and when the position of an object of one shape shifts, it changes to another shape. It works so that all parts can be regarded as having the same shape.
  • Upsampling layers 364 and 365 perform class classification on the original image and use the results in each layer of the CNN as a feature map through skip connections shown at 366 and 367 in FIG. , and the third detection layer 363 enable detailed region identification. Note that the skip connections 367 and 366 connect networks having the same configuration as the convolution layers 373 and 374 after the convolution layers 385 and 381, respectively.
  • a method for calculating a confidence score 317 (corresponding to likelihood), which corresponds to the detection accuracy and confidence of the YOLO model 360 in an embodiment, will be explained with reference to FIG. 3B, in which one person is used as the detection object.
  • the model input image 311 is Divide the image area into grid cells of arbitrary size (a 7x7 example is shown in FIG. 3B).
  • a step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object) ⁇ IOU), and a step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object) ) 315 calculation process 314 are processed in parallel. Thereafter, both are multiplied when calculating a confidence score 317 in a final detection step 316. Therefore, processing speed can be improved by simultaneously detecting the position of the object and identifying the class.
  • a position information detection frame 318 including the first detection frame indicated by a dotted line in the final detection step 316 is a detection frame displayed as a detection result for a person.
  • the position information 301 including the first detection frame is sorted based on mutual IOU values, the maximum likelihood information 302 is determined, etc., and each detected object is The position information 401 and the second likelihood information 402 are corrected to include the second detection frame considered to be the most appropriate.
  • Position information 401 including the second detection frame and second likelihood information 402 are input to the image output control means 110 and the robustness verification means 500.
  • the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame
  • the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
  • the IOU value will be explained with reference to FIG.
  • the denominator of the formula representing the IOU value 420 in FIG. 4(a) is the Area of Union 422 in the above-mentioned (Formula 1), which is the area of the union of the two frame areas to be compared.
  • the numerator of the formula representing the IOU value 420 in (a) of FIG. 4 is the Area of Intersection 423 in the above-mentioned (Formula 1), which is the area of the common portion of the two frame regions to be compared.
  • the maximum value is "1.0", indicating that the two frame data completely overlap.
  • the groundtruth BBox 425 which is the correct answer frame for the person 424
  • the PredictedBBox 426 which is calculated as a result of inference (prediction)
  • the IOU value 427 of both will drop to about 0.65.
  • Corrected position information 401 and second likelihood information 402 including the maximum likelihood second detection frame for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index representing the It may also be characterized by having an individual identification means 410.
  • position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input.
  • position information 441, 442, 443, and 444 including the four first detection frames output from the object detection model 300 and the four first likelihood information It is assumed that likelihoods 445, 446, 447, and 448 are input.
  • the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold "T", it is determined to be true, and the mutual IOU value calculation step S434 is performed. , a process is performed to calculate the IOU value of the mutual combination of the position information 301 including all the first detection frames to be calculated. In FIG.
  • comparison step S435 all mutual IOU values are compared with the threshold value "U", and if the mutual IOU value is less than the threshold value "U” and determined to be false, it is determined that the detection results are independent. Then, in output step S437, the position information 401 including the second detection frame and the second likelihood information 402 are outputted, and if the mutual IOU value is equal to or greater than the threshold value "U", it is determined to be true, and the same detection It is assumed that the object is detected redundantly, and the process proceeds to the next maximum likelihood determination step S436. In FIG.
  • the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false.
  • the likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S437, the position information 401 including the second detection frame and the second detection frame are deleted. It may also be output as likelihood information 402. In FIG.
  • the first likelihood information including likelihood 447 (0.75) and the position information 443 including the first detection frame are deleted from the calculation target, and the position including the first likelihood information including the likelihood 448 (0.92) determined to be the maximum likelihood and the first detection frame is calculated.
  • the information 444 is output as position information 452 including the second detection frame and second likelihood information including the likelihood 454 (0.92) in output step S437.
  • the mutual IOU value threshold "U" is set low, when there are multiple detected objects, the detection results of multiple detected objects will be merged more than expected, especially for objects that are close to each other. Leaks are more likely to occur. On the other hand, if the value is set high, duplicate detection results may remain even though the same object is detected. Therefore, it is desirable to set appropriately according to the performance of the object detection model 300.
  • the individual identification means 410 may perform individual identification using a combination of steps other than the flowchart shown in FIG. 5A.
  • the class identification information in the first likelihood information 302 may be used to limit the objects for which mutual IOU values are calculated in the mutual IOU value calculation step S434 to the same class, or the maximum When determining the likelihood, processing for determining the maximum likelihood within the same class may be added.
  • model post-processing means 400 and the individual identification means 410 as shown in FIGS. 5A and 5B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.
  • the position information 301 includes zero or more first detection frames including undetectable and false detection.
  • the likelihood information 302 of Correction is made to position information 401 including the second detection frame with the maximum likelihood and second likelihood information 402 for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU value, which is an index showing how much overlap. It may also be characterized by having an individual identification means 410.
  • the annotation means 620 may create supervised data by adding class identification information and a groundtruth BBox, which is a correct answer frame, to the image stored in the display and data storage means 120, for example.
  • the position information 621 including the correct detection frame and the correct class identification
  • position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input.
  • position information 621 including a detection frame that is the correct answer for each detected object and class identification information 622 that is the correct answer are input.
  • position information 471, 472, 473, and 474 including the four first detection frames output from the object detection model 300, and four first likelihood information It is assumed that likelihoods 475, 476, 477, and 478 are input.
  • the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold value “T”, it is determined to be true, and the IOU value with the correct frame is calculated.
  • step S461 processing is performed to calculate the IOU value of the combination of the position information 301 including all the first detection frames to be calculated for each piece of position information 621 including the correct detection frame.
  • the first likelihood information included is deleted from the calculation target. There are three calculation candidates remaining, and the IOU values of the position information 471, 473, and 474 including the first detection frame are calculated for each of the position information 480 and 481 including the correct detection frame.
  • comparison step S462 all IOU values are compared with the threshold value "U", and if the IOU value for the position information 621 including the correct detection frame is less than the threshold value "U" and determined to be false, It is determined that the position information 301 and the first likelihood information 302 including the corresponding first detection frame are deleted from the calculation target in a deletion step S433, and the IOU value is set to the threshold “U”. ⁇ If it is, it is determined to be true, and it is regarded as a detection target candidate with a small difference from the correct answer frame, and the process proceeds to the next class identification determination step S463. In FIG. 6B, candidates that are determined to be false are not applicable, and the three calculation candidates become the determination targets of the class identification determination step S463.
  • class identification determination step S463 the class identification information 622 that is the correct answer and the class identification information in the first likelihood information 302 that is the correct answer are compared, and if they are identified as different classes, it is determined to be false. Then, in a deletion step S433, the position information 301 including the corresponding first detection frame and the first likelihood information 302 are deleted from the calculation target, and if they are identified as the same class, it is determined to be true. The process then proceeds to the next maximum likelihood determination step S436. In FIG. 6B, assuming that all the candidates are determined to be "human" as a result of class identification, the three calculation candidates are directly subjected to the determination in the maximum likelihood determination step S436.
  • the maximum likelihood determination step S436 the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false.
  • the likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S464, the position information 401 including the second detection frame and the second detection frame are deleted.
  • the likelihood information 402 and the calculated IOU value may be output.
  • the maximum likelihood is calculated from the two likelihoods 477 (0.75) and 478 (0.92).
  • the first likelihood information including the likelihood 477 (0.75) and the position information 473 including the first detection frame are deleted from the calculation target, and the likelihood determined to be the maximum likelihood is 478 (0.92) and position information 474 including the first detection frame are combined with position information 491 including the second detection frame and second likelihood information 493 (0.92). It is output as likelihood information in output step S464. Further, an IOU value of 495 (0.85) is output in output step S464.
  • Position information 471 including one detection frame is output as position information 490 including a second detection frame and second likelihood information including likelihood 492 (0.85) in output step S464. Further, the IOU value 494 (0.73) is outputted in output step S464.
  • the threshold value "U" of the IOU value with the correct answer frame is set lower than that of the individual identification means 410 described with reference to FIGS. 5A and 5B, and more calculation candidates are left, the detection results in a correct answer. Since direct comparison can be made with the position information 621 including the frame, there is an advantage that detection omissions are less likely to occur and the accuracy of the detection results is improved. Furthermore, by arbitrarily changing the threshold value “U” and processing, it is also possible to understand and verify the accuracy of the detection frame of the position information 301 including the first detection frame calculated by the object detection model 300. Become.
  • model post-processing means 400 and the individual identification means 410 as shown in FIGS. 6A and 6B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.
  • a series of means for generating position information 401 including the second detection frame and second likelihood information 402 using the image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This was a conventional first performance indexing device 30 shown in FIG. 17 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
  • a one-stage DNN model is typified by a one-stage DNN model that is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class.
  • the case where the YOLO model 360 is applied will be explained using FIGS. 7A and 7B. As shown in FIG.
  • the position of the person in the image is calculated even though the same person is detected.
  • the respective likelihoods may vary greatly, such as 0.92, 0.39, and 0.89.
  • the distance between the camera and the person is 1 m in image 204, 2 m in image 205, and 3 m in image 206, as a result of the change in the size of the person and the position in the image.
  • the positional information 211, 212, and 213 including the detection frame and the likelihoods 217, 218, and 219 in the second likelihood information are calculated, when considering the performance of the original YOLO model, the person size is small. It is known that the detection accuracy and performance deteriorate as the distance between the person and the person increases.
  • the degree 217 is 0.92, and the likelihood 219 in the second likelihood information of the image 206 with the detected object distance of 3 m is 0.71, whereas the second likelihood information of the image 205 with the detected object distance of 2 m is Irregular results may be obtained in which the likelihood 218 is significantly reduced to 0.45.
  • FIG. 1 When processing a plurality of model input images 210, as various processing parameters 510, in S (arbitrary decimal) pixel steps, N (arbitrary integer) times in the horizontal direction and M (arbitrary integer) in the vertical direction. It may also include a position shift function 220 that uses the position shifts to generate a total of N ⁇ M position shifted model input images 221 to 224. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 may be a function realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
  • the model preprocessing means 200 when processing the plurality of model input images 210 to be input to the object detection model, the model preprocessing means 200 further sets L (arbitrary integer) types of arbitrary processing parameters 510 to the object detection model.
  • the resizing function 230 generates an enlarged or reduced image using a magnification of Even if it includes a position shift function 220 that uses M (any integer) position shifts in the vertical direction to generate a total of N ⁇ M ⁇ L resized and position-shifted model input images 210. good. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 and the resizing function 230 may be realized by executing the affine transformation function 291 and the projective transformation function 292 in the image processing processor 290.
  • a plurality of model input images 210 processed by the position shift function 220 and resizing function 230 of the model pre-processing means 200 as shown in FIGS. 8 and 9 are combined with the object detection model 300 shown in FIG. 1 and the model post-processing means 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510. It is input to a robustness verification means 500 that verifies robustness.
  • the items and various variable conditions to be verified by the robustness verification means 500 include, for example, the background (scenery), camera lens specifications, the height at which the camera is mounted, etc.
  • Detection target area and field of view including image size such as elevation/depression angle, dewarping processing method when using a fisheye lens, illuminance changes depending on sunlight and lighting, special conditions such as blackout, overexposure, backlighting, clear weather, etc.
  • Weather conditions include cloudy weather, rain, snow, and fog.
  • the position (left, right, top, bottom, and depth) of the target detection object in the image size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, mutual overlap status, and attachments are also included. These include the type, size, attached position, whether or not the lens has IR cut, the moving speed of the target detection object, and the moving speed of the camera itself.
  • items and conditions other than those described above may be added.
  • Various processing parameters 510 are set based on these various conditions and items. Alternatively, various processing parameters 510 are selected or determined.
  • Various processing parameters 510 are input to model pre-processing means 200 and model post-processing means 400.
  • Various processing parameters 510 input to the model preprocessing means 200 include parameters related to the position shift function 220 for verifying the influence of fluctuations accompanying the object position, camera lens specifications, the height and elevation/depression angle at which the camera is mounted, etc. It is a combination of parameters related to the resizing function 230 to verify the versatility and robustness of the detection target area including the image size and the object size of the field of view, such as the conditions for , and other multiple parameters described below. It's okay to have one.
  • the model post-processing means 400 individually links various processing parameters 510 used in processing the plurality of images by the model pre-processing means 200 and the output results of the individual identification means 410 for each detected object.
  • the detected detection result 403 (including position information 401 including the second detection frame, second likelihood information 402, etc.) may be output to the robustness verification means 500.
  • the robustness verification means 500 is based on the likelihood in the position information 401 including the second detection frame and the second likelihood information 402, which are the output results of the model post-processing means 400.
  • a likelihood distribution 540 indicating the variation due to the position shift of each detected object, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a likelihood
  • the standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood
  • the maximum likelihood 503 which is the maximum value of the valid region of likelihood
  • the minimum likelihood 504 which is the minimum value of the valid region of likelihood
  • the likelihood It may be characterized by comprising a probability statistical calculation means 520 that calculates any or all of the IOU values 505 for the IOU values 505.
  • the robustness verification means 500 uses the output result of the model post-processing means 400 when there is position information 621 including a correct detection frame and correct class identification information 622 for each detected object. Calculated from the IOU value of the position information 401 including the second detection frame, the position information 621 including the correct detection frame, the class identification information in the second likelihood information 402, and the correct class identification information 622. Based on the class identification accuracy rate, the IOU distribution and class identification accuracy rate distribution showing the variation due to the position shift of each detected object with respect to the IOU value and class identification accuracy rate, IOU value and class identification are calculated for each of various processing parameters 510.
  • the average IOU value which is the average value of the effective area of the correct answer rate, the average class identification correct answer rate, the histogram of the IOU value, the histogram of the class identification correct answer rate, and the standard deviation of the effective area of the IOU value and the class identification correct answer rate.
  • the standard deviation of the IOU value and the standard deviation of the class identification accuracy rate, the maximum IOU value and the maximum class identification accuracy rate which are the maximum values of the valid area of the IOU value and the class identification accuracy rate, and the effective area of the IOU value and the class identification accuracy rate.
  • the present invention may be characterized in that it includes a probability statistical calculation means 520 that calculates either or both of the minimum IOU value, which is the minimum value of , and the minimum class identification correct answer rate.
  • the robustness verification means 500 further extracts a position or region in which the likelihood distribution 540 for each detected object is equal to or less than an arbitrary threshold value for each of the various processing parameters 510, and extracts a position or region where the average likelihood 501 is an arbitrary value.
  • the present invention is characterized by having learning reinforcement necessary item extraction means 530 that includes one or both of the following: extracting a detected object whose IOU value 504 is equal to or less than an arbitrary threshold value; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold value. It may be something that you do.
  • the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and extracts an arbitrary value in the class identification accuracy rate distribution for each of the various processing parameters 510. Extraction of a position or area where the average IOU value is below a threshold value, extraction of detected objects whose average IOU value is below an arbitrary threshold value, extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and standard deviation of the IOU value.
  • extraction of detected objects for which the standard deviation of the class classification accuracy rate is greater than or equal to an arbitrary threshold extraction of detected objects for which the maximum IOU value is less than or equal to an arbitrary threshold; Extraction of detected objects whose class identification accuracy rate is below an arbitrary threshold, extraction of detected objects whose minimum IOU value is below an arbitrary threshold, and extraction of detected objects whose minimum class identification accuracy rate is below an arbitrary threshold. It may be characterized by having learning reinforcement necessary item extraction means 530 that includes any or all of them.
  • the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 perform probability statistical calculations based on the likelihood, IOU value, and class identification correct answer rate. Furthermore, it may be characterized by having a function of excluding from the calculation target images in which pixels related to the target detection object are missing at an arbitrary rate.
  • model learning dictionary 320 It becomes possible to specify the reinforcement targets of the model learning dictionary 320 based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530. Furthermore, it is also possible to extract problems for the object detection model 300. Furthermore, the versatility and robustness of the model learning dictionary 320 can be improved by inputting this extracted information 531 into a dictionary learning means 600, which will be described later in the second embodiment, and reflecting it in the selection of learning materials, the augmentation method, and the learning parameters. It becomes possible to strengthen it.
  • the versatility and robustness of the object detection model 300 is input by inputting a detection result 403 in which various processing parameters 510, position information 401 including the second detection frame, and second likelihood information 402 are linked individually for each detected object.
  • 1 is an embodiment of a performance indexing device 10 in object detection of the present invention for analysis.
  • the performance indexing device for object detection of the present invention may further include a dictionary learning means 600 of Embodiment 2 described later for generating a model learning dictionary 320, and a second mAP calculation means 650. good.
  • FIG. FIG. 10 and FIG. 11 show the results of analyzing the irregular variation phenomenon in the likelihood, etc., which is the detection result by the object detection model 300, with respect to , and the size of the detected object.
  • the analysis results shown in FIGS. 10 and 11 are obtained by setting Xi, which is the number of pixels in the horizontal direction, of the plurality of model input images shown in FIGS. 7A, 7B, 8, and 9 to 128, and pixels in the vertical direction. This is a case where the number Yi is set to 128. Furthermore, the detection target is one person.
  • the resizing function 230 of the model preprocessing means 200 is used to resize three types (three types L) of a standard size image 232, a 30% reduced image 231, and a 30% enlarged image 233.
  • the analysis results shown in FIGS. 10 and 11 are based on the YOLO model 360 (object detection model) shown in FIGS. 3A and 3B, which has 128 input pixels in the horizontal direction and 128 pixels in the vertical direction.
  • the position information 401 including the second detection frame for one person and the second
  • the likelihood information 402 of a likelihood distribution 540 indicating dispersion an average likelihood 501 that is the average value of the valid region of likelihood, a histogram 550 of likelihood, and a standard deviation of likelihood 502 that is the standard deviation of the valid region of likelihood;
  • This is the result of calculating the maximum likelihood 503, which is the maximum value of the valid region of likelihood, and the minimum likelihood 504, which is the minimum value of the valid region of likelihood.
  • the likelihood distribution 540, average likelihood 501, likelihood histogram 550, likelihood standard deviation 502, maximum likelihood 503, and minimum likelihood 504 are based on the maximum likelihood value “1”.
  • the likelihood may also be expressed as a percentage (%) of "100%".
  • the likelihood is expressed in percentage (%). Note that it is also possible to directly process the value as a decimal number without converting it to a percentage.
  • the class identification distribution and statistical results for the class identification information in may be calculated.
  • various processing parameters 511 to 513 shown in FIG. is linked to the information 402, and may be used when the probability statistical calculation means 520 calculates analysis results for each of various processing parameters.
  • Likelihood distributions 541, 542, and 543 shown in FIG. 10 correspond to the level of likelihood (%) for fluctuations in the position (in pixels) of a person on the screen according to the gray scale bar 521 from white to black. It is displayed in shades of white (corresponding to 0% likelihood) to black (corresponding to 100% likelihood).
  • This corresponds to a mapping of the likelihood calculated in .
  • the likelihood distributions 541, 542, and 543 indicate that the stronger the black level, the higher the likelihood, and conversely, the stronger the white level, the lower the likelihood.
  • S, N, and M which are the various processing parameters 510 of the position shift function 220, may be changed depending on the use and purpose.
  • S which is the pixel step setting, may be set to different values in the horizontal direction and the vertical direction. Setting S to a small value has the advantage of allowing detailed verification, but has the disadvantage of increasing calculation processing time.
  • the processing parameters for position shifting N times in the horizontal direction and M times in the vertical direction are preferably set to appropriate values that allow verification of positional fluctuations, depending on the structure of the object detection model 300.
  • the likelihood histogram 551 shown in FIG. 11 is obtained by normalizing the frequency of the likelihood (%) calculated by the probability statistical calculation means 520 for the likelihood distribution 541 shown in FIG. .0). Further, the statistical result 561 displays the average likelihood (%), standard deviation of the likelihood (%), maximum likelihood (%), and minimum likelihood (%) for the likelihood distribution 541. Furthermore, the likelihood 571 of the conventional method corresponds to the likelihood calculated by the conventional first performance indexing device 30 described above, which is a pinpoint value of the model input image 231 serving as a reference image for position shift shown in FIG. The calculated likelihood is displayed. Similarly, likelihood histograms 552 and 553, statistical results 562 and 563, and likelihoods 572 and 573 of the conventional method shown in FIG. 11 correspond to likelihood distributions 542 and 543 shown in FIG. 10, respectively. be.
  • the average likelihood (%) in the statistical results 561, 562, and 563 is an index for verifying the average detection accuracy and detection performance with respect to fluctuations depending on the position in the screen, and the higher the value, the more the model learning dictionary 320 is included. It can be considered that the object detection model 300 has high performance. Further, the standard deviation (%) of the likelihood is an index indicating the dispersion of the likelihood with respect to fluctuations depending on the position in the screen, and it can be considered that the smaller the standard deviation (%), the higher the stability of the object detection model 300 including the model learning dictionary 320. .
  • the standard deviation (%) of the likelihood is large, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient. is possible. Furthermore, by checking the likelihood distributions 541, 542, and 543 explained in FIG. 10, it is possible to verify which factor is stronger. Furthermore, by verifying the maximum likelihood (%) and the minimum likelihood (%), it is also possible to determine whether the dispersion of the likelihood is close to a normal distribution. It can be considered that the higher the maximum likelihood (%) and the minimum likelihood (%), the higher the performance of the object detection model 300 including the model learning dictionary 320. On the other hand, if it becomes extremely low, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient.
  • this example shows the case where the detection target is one person, but if there are multiple detection targets or there are multiple objects of classes other than people, the likelihood is calculated for each detection target. It may be possible to calculate a degree distribution and its statistical results, an IOU distribution and its statistical results, a class identification distribution and its statistical results.
  • FIGS. 10 and 11 showing the verification results calculated by the performance indexing device 10 in object detection according to the first embodiment of the present invention
  • the model input is resized into three types in which one person exists as the detected object.
  • An example of the verification method used when performing verification, issue analysis, and factor analysis is shown below.
  • An example of the verification method described in this example is the implementation of an electronic circuit as a means of operating the YOLO model 360 in order to miniaturize, save power, and reduce costs of cameras for detecting objects.
  • the image size input to YOLO Model 360 may not be the original recommended size due to limitations in area or power consumption, limitations in memory capacity, or limitations in the performance of arithmetic processors such as the DSP (digital signal processor) installed. This is a verification result assuming a case where the input image size of the YOLO model 360 has to be made smaller than that of the currently used YOLO model 360, and does not always occur with the various recommended variations of the YOLO model 360.
  • the gray or white levels have a strong likelihood in a particular grid-like pattern. It can be confirmed that there is a region where the value is low. Therefore, as explained in FIG. 7A, even when the same object is detected, if the position of the detected object in the image fluctuates, a phenomenon occurs in which the likelihood, which is one of the detection results, varies greatly. Conceivable.
  • the specific grid pattern seen in the likelihood distributions 541 and 542 is characterized by a pattern of about 8 pixels square
  • the specific grid pattern seen in the likelihood distribution 543 is characterized by a pattern of about 16 pixels square.
  • the region can be set to any size in order to detect the position of the object and identify the class (classification) at the same time.
  • Object) 315 In order to calculate the conditional class probability Pr(Classi
  • the likelihood (%) 572 of the conventional method which is the standard size, is 49.27%, which is much lower than the likelihood (%) 571 of the conventional method when the size is reduced by 30%. For this reason, simply checking this result may lead to an erroneous conclusion that the learning of the model learning dictionary 320 for a person of the reference size is insufficient, and unnecessary additional learning may be performed.
  • the likelihood (%) 571 of the conventional method when reduced by 30% is 70.12%, which is considered a passing score, and the fact that additional learning is not performed in the first place reduces the versatility of the model learning dictionary 320. It is also conceivable that the robustness enhancement will be insufficient.
  • the likelihood histograms 551, 552, and 553 indicate at what level the likelihoods of the likelihood distributions 541, 542, and 543 in FIG. 10 exist. It can be considered that the performance is better when the occurrence frequency is concentrated on the right end where the likelihood is high. Also, it can be considered that the less variation there is, the more stable it is. As far as I checked the likelihood histograms 551, 552, and 553, unlike the likelihoods (%) 571, 572, and 573 of the conventional method, the likelihoods (%) are distributed in descending order of person size. I understand that.
  • the statistical results 561, 562, and 563 which are the results of statistical analysis of the likelihood distributions 541, 542, and 543 and the likelihood histograms 551, 552, and 553 shown in FIG.
  • the average likelihood (%) becomes closer to the original ideal as the person size increases, 60.85% ⁇ 71. It can be seen that the rates increase in the order of 82% ⁇ 89.98%. Therefore, the results of likelihood (%) 571, 572, and 573 of the conventional method have the problem that the detection results are blurred due to fluctuations in the position of the person in the image depending on the specific grid pattern. was confirmed.
  • the development goal of the model learning dictionary 320 was to increase the average likelihood (%) to 70% or more, for example.
  • the average likelihood (%) threshold to 70%, in the case of a size reduced by 30%, the likelihood (%) result of the conventional method seemed to be achieved by chance.
  • it is less than the threshold which is more than 9% short, so it becomes possible to find out that reinforcement through additional learning is necessary for the person who has been reduced by 30%.
  • maximum likelihood (%) and minimum likelihood (%) can be used as materials for various judgments.
  • the minimum likelihood threshold is set to 30%, the standard size that will be 30% or less Regarding a person and a person reduced by 30%, if the object position stops at that position, there is a risk that it may become undetectable, so it is also possible to extract these potential issues and problems in advance.
  • a model input image 526 of 128 pixels in the horizontal direction and 128 pixels in the vertical direction in which a person is located far away (at the top of the screen) is used as a reference image, and the position of the model preprocessing means 200 is shifted.
  • the results of calculating the likelihood distribution 544 by the robustness verification means 500 including the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 are shown.
  • the likelihood distribution 544 changes from white (equivalent to 0% likelihood) according to the level of likelihood (%) for fluctuations in the position (in pixels) of the person on the screen, according to the gray scale bar 521 from white to black. It is displayed in shades of black (likelihood 100%).
  • the upper side of the likelihood distribution 544 that is, the area 527 surrounded by the dotted line, is an area where the white level is stronger and the likelihood is lower than other areas. I understand that there is something.
  • an area 527 surrounded by a dotted line can be considered to indicate a case where a person exists in an area 528 surrounded by a dotted line that extends to the lower right side of the center of the person in the model input image 526.
  • the above-mentioned specific grid pattern can also be observed, but the area 527 surrounded by the dotted line has a particularly low likelihood, so the learning reinforcement necessary item extraction means 530 concentrates it and reduces the likelihood. If it is a region, it can be extracted. Therefore, if the person in the model input image is located in the area 528 surrounded by the dotted line, it can be confirmed that the object detection ability is low, and it can be realized that the model learning dictionary 320 needs to be strengthened. . Therefore, the model learning dictionary 320 can be efficiently strengthened by the dictionary learning means 600 of Embodiment 2, which will be described later. Leads to reinforcement.
  • the verification method in this example shows the case where the detection target is one person, but if there are multiple detection targets or if there are multiple objects of classes other than people, the verification method for each detection target is Then, a model learning dictionary is created based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530 for the likelihood distribution and its statistical results, the IOU distribution and its statistical results, and the class identification distribution and its statistical results.
  • 320 reinforcement targets may be specified.
  • problems for the object detection model 300 may be extracted.
  • the versatility and robustness of the model learning dictionary 320 may be enhanced by a dictionary learning means 600, which will be described later, with reference to the extracted information 531.
  • the object detection model 300 may be applied to a DNN model such as the same one-stage SDD.
  • the present invention may be applied to a two-stage DNN model such as EfficientDet, which processes object position detection and class identification in two stages.
  • it may be applied to object detection models and machine learning models that do not use neural networks.
  • Performance indexing device 10 in object detection of the present invention using the image processing means 100, model pre-processing means 200, object detection model 300, model post-processing means 400, and robustness verification means 500 described in Embodiment 1 so far As a result, the following usefulness and effects can be expected.
  • the object detection model is created by checking the likelihood distribution 540 for the position of each detected object with respect to the plurality of model input images 210 processed by the position shift function 220 of the model preprocessing means 200.
  • This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in the position of the detected object on the screen due to the potential problems that the neural network itself including the DNN model in the object detection model has. This makes it possible to accurately identify issues related to accuracy and performance during inference.
  • it is possible to effectively formulate methods and methods for solving problems it is possible to improve the detection accuracy and detection performance of the object detection model.
  • the robustness verification means 500 further calculates a likelihood distribution 540 indicating the dispersion due to the position shift of each detected object, an average likelihood 501 that is the average value of the valid area of the likelihood, and a likelihood a histogram 550, a standard deviation of likelihood 502 which is the standard deviation of the valid area of likelihood, a maximum likelihood 503 which is the maximum value of the valid area of likelihood, and a minimum value which is the minimum value of the valid area of likelihood.
  • the detected object position on the screen can be This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in It becomes possible to extract Furthermore, since methods and methods for solving problems can be formulated more effectively, detection accuracy and detection performance of the object detection model 300 can be further improved. Furthermore, when combined with various machining parameters 510 other than position shift, the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
  • the robustness verification means 500 further detects variations due to position shifts for each detected object.
  • IOU distribution and class identification accuracy rate distribution showing the average IOU value, average class identification accuracy rate, histogram of IOU value, histogram of class identification accuracy rate, standard deviation of IOU value, standard of class identification accuracy rate.
  • the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
  • the model preprocessing means 200 further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnification as various processing parameters 510, and then performs the above-mentioned position shift.
  • the robustness verification means 500 including the probability statistical calculation means 520 calculates the likelihood distribution 540 for the position of each detected object for each L size, the average likelihood 501 of the valid area of the likelihood, and the likelihood distribution 540 for the position of each detected object for each L size. It becomes possible to check the histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505.
  • the model post-processing means 400 further includes an individual identification means 410, thereby eliminating abnormal data and correcting position information and likelihood information including a detection frame for each detected object to suitable information. Therefore, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, the histogram 550 of the likelihood, the standard deviation of the likelihood 502, the maximum likelihood 503, the minimum likelihood 504, It becomes possible to calculate the IOU value 505 more accurately. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.
  • the model post-processing means uses the individual identification means 410 to eliminate abnormal data. Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each detected object, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, and the likelihood It becomes possible to accurately calculate the degree histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.
  • the model post-processing means 400 further associates each output result with the various machining parameters 510 for each detected object and outputs the results to the robustness verification means.
  • the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the likelihood distribution 540 for each detected object, and extracts the average likelihood 501 for each of the various processing parameters 510. extraction of detected objects for which the standard deviation 502 of the likelihood is equal to or greater than an arbitrary threshold; extraction of detected objects for which the maximum likelihood 503 is equal to or less than an arbitrary threshold; It has a learning reinforcement necessary item extraction means 530 that includes either or all of the following: extracting a detected object whose likelihood 504 is equal to or less than an arbitrary threshold; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold.
  • the robustness verification means 500 further extracts a position or region where the IOU distribution is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and determines the class identification accuracy rate for each detected object, for each of the various processing parameters 510.
  • the model learning dictionary 320 is created by deep learning or the like based on the position information including the detection frame and the class identification information by having the learning enhancement necessary item extraction means 530 that includes any or all of the object extraction functions. It becomes possible to more accurately understand the weaknesses in generality and robustness against various fluctuation conditions caused by this, as well as the strengthening policy, by separating them from the potential problems of the neural network itself, including the DNN model. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
  • the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 further perform probability statistical calculation based on the likelihood, IOU value, and class identification correct answer rate.
  • the object in the image that serves as the reference for verification is provided. Even if the effective range of the object to be detected is missing depending on the position of the object and the position of the object after processing various processing parameters 510 of the model preprocessing means 200, the performance and characteristics of the object detection model 300 are accurate. It becomes possible to verify the versatility and robustness of the learning dictionary 320. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary 320.
  • the model preprocessing means 200 when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further includes P (arbitrary integer) types of contrast correction as various processing parameters 510. It may be characterized by generating an image with the brightness level changed to an arbitrary value using a curve or a gradation conversion curve. Furthermore, after changing the gradation, the image is shifted horizontally by N (any integer) times and vertically by M (any integer) times in S (any decimal) pixel steps.
  • the image forming apparatus may include a position shift function 220 that generates a total of N ⁇ M ⁇ P tone-converted and position-shifted model input images 210. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the brightness level using a contrast correction curve or a gradation conversion curve, the function may be implemented by being executed by the image processing processor 290.
  • a low brightness level image 261 with a lower level and a gradation conversion curve 266 (P 3) that simulates clear weather conditions with high illuminance, backlighting, overexposure, and a shooting studio illuminated with strong light.
  • a case is shown in which three types of gradation-converted images, such as a high-luminance-level image 263 with a higher luminance level processed as a result of application, are generated.
  • three types of gradation-converted images such as a high-luminance-level image 263 with a higher luminance level processed as a result of application.
  • N ⁇ M position-shifted images are generated in S pixel steps, and a total of 3 ⁇ N ⁇ M multiple model input images 210 are generated. It may also be something that processes.
  • a plurality of model input images 210 processed by the position shift function 220 and tone conversion function 260 of the model preprocessing means 200 as shown in FIGS. 8 and 13 are combined with the object detection model 300 shown in FIG.
  • the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210
  • the general purpose of the object detection model 300 is calculated based on the various processing parameters 510. 10 and 11 for P (arbitrary integer) types of contrast correction curves or gradation conversion curves that are various processing parameters 510.
  • the probability statistical calculation means 520 as described above generates a likelihood distribution 540 indicating the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, and a likelihood histogram 550. , the standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood, the maximum likelihood 503 which is the maximum value of the valid region of likelihood, and the minimum likelihood 504 which is the minimum value of the valid region of likelihood. It may also be something that calculates.
  • the IOU value 505 may be calculated.
  • histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate. , histogram, standard deviation, maximum value, and minimum value may be calculated.
  • the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
  • model preprocessing means 200 is equipped with a gradation conversion function 260, it is possible to improve the DNN model for the brightness levels of the detected object and background that change depending on weather conditions, shooting time, and illuminance conditions of the shooting environment, and to use a general-purpose model learning dictionary. This makes it possible to enhance performance and robustness.
  • the model preprocessing means 200 when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further uses Q (arbitrary integer) types of aspect ratios as various processing parameters 510. may be used to generate an image with a changed aspect ratio. After changing the aspect ratio, use a position shift of N (any integer) horizontally and M (any integer) vertically in S (any decimal) pixel steps. , a position shifting function 220 that generates a model input image 210 with a total of N ⁇ M ⁇ Q aspect ratio changes and position shifts. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the aspect ratio using Q types of aspect ratios, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
  • N ⁇ M position-shifted images are generated in S pixel steps, and a total of 3 ⁇ N ⁇ M multiple model input images 210 are generated. It may also be something that processes.
  • a plurality of model input images 210 processed by the position shift function 220 and aspect ratio change function 250 of the model preprocessing means 200 as shown in FIGS. 8 and 14 are combined with the object detection model 300 shown in FIG.
  • the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210
  • the general purpose of the object detection model 300 is calculated based on the various processing parameters 510.
  • the input data is input to the robustness verification means 500 that verifies the stability and robustness, and the probability statistical calculations as explained in FIGS.
  • the means 520 generates a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and an effective area of the likelihood. Even if it calculates the standard deviation of likelihood 502 which is the standard deviation, the maximum likelihood 503 which is the maximum value of the valid area of likelihood, and the minimum likelihood 504 which is the minimum value of the valid area of likelihood. good.
  • the IOU value 505 may be calculated.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. , the minimum value may be calculated.
  • the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
  • the model preprocessing means 200 further processes R (arbitrary integer) types of angles as various processing parameters 510 when processing the plurality of model input images 210 input to the object detection model 300. It may also be characterized in that it is used to generate an image with a changed rotation angle. After changing the rotation angle, the image is moved in S (any decimal) pixel steps using a position shift of N (any integer) horizontally and M (any integer) vertically. , a total of N ⁇ M ⁇ R rotation angle changes and a position shift function 220 that generates a position-shifted model input image 210 may be provided. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the rotation angle using R types of angles, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
  • N ⁇ M position-shifted images are generated in S pixel steps, and a total of 3 ⁇ N ⁇ M multiple model input images 210 are generated. It may also be something that processes.
  • a plurality of model input images 210 processed by the position shift function 220 and rotation function 240 of the model preprocessing means 200 as shown in FIGS. 8 and 15 are combined with the object detection model 300 shown in FIG. 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510.
  • the robustness is input to the robustness verification means 500 for verifying robustness, and the probability statistical calculation means 520 as explained in FIGS.
  • a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a standard deviation of the effective area of the likelihood. It may be possible to calculate a standard deviation 502 of a certain likelihood, a maximum likelihood 503 which is the maximum value of the valid region of likelihood, and a minimum likelihood 504 which is the minimum value of the valid region of likelihood.
  • the IOU value 505 may be calculated.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate, The minimum value may be calculated.
  • the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
  • model preprocessing means 200 By equipping the model preprocessing means 200 with the rotation function 240, it becomes possible to improve the DNN model for various rotation angles of the detected object and to enhance the versatility and robustness of the model learning dictionary.
  • the model preprocessing means 200 processes the plurality of model input images 210 inputted to the object detection model 300 by processing 281 to 281 in FIGS. 8, 9, 14, and 15. Calculate the average brightness level of the valid image in the blank space where no valid image exists due to the position shift process, resize process, aspect ratio change process, or rotation process shown in 288, and make the average brightness level uniform. It may also include a margin padding function 280 for pasting to generate an image.
  • the blank space may be interpolated using the effective image area existing in the output image of the image processing means 100.
  • the blank space may be filled with images that do not affect learning or inference.
  • the model preprocessing means 200 is equipped with the margin padding function 280, the influence of features including margins on the inference accuracy of the object detection model 300 can be reduced. It becomes possible to more accurately calculate the average likelihood 501 of the effective region of likelihood, the histogram 550 of likelihood, the standard deviation 502 of likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
  • the various processing parameters 510 used for processing by the model preprocessing means 200 include a resizing function 230 involving the position shift function 220 described above, a rotation function 240, an aspect ratio changing function 250, and a gradation function. It is also possible to perform a plurality of processes in which the conversion functions 260 are intertwined with each other. Further, the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 may be used as a method for analyzing the interdependence of the plurality of various processing parameters 510. Furthermore, although the explanation was omitted in the first embodiment, various types of It becomes possible to improve the DNN model with respect to distortion of the detected object and background, and to enhance the versatility and robustness of the model learning dictionary.
  • the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, we will improve the object detection model 300 and repeatedly perform deep learning using the dictionary learning means 600 described later in Embodiment 2 to improve the object detection model 300 and solve and strengthen it. It becomes possible to realize object detection that is highly versatile and robust even under various conditions.
  • FIG. 16 is a block diagram showing a performance indexing device 20 for object detection in an image according to Embodiment 2 of the present invention.
  • Each means, each function, each process, each step, and each device, each method, each program, etc. for realizing them are the same as those in Embodiment 1, so they will be explained in the text of Embodiment 2. omitted.
  • each means, each function, each process, each step, each apparatus, each method, each program, etc. of the other embodiments described in Embodiment 1 may be used and implemented.
  • each means, each function, and each process described in Embodiment 2 of the present invention described later may be replaced with a step, and each device may be replaced with a method.
  • each means and each device described in Embodiment 2 of the present invention may be realized by a program operated by a computer.
  • dictionary learning means 600 which is deep learning for creating the model learning dictionary 320, which is one of the components of the object detection model 300, will be described.
  • learning material data that is considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data (image data) for deep learning is stored.
  • the material data for learning stored in the learning material database storage means 610 may be one that utilizes a large-scale open source dataset such as COCO (Common Object in Context) or Pascal VOC Dataset.
  • COCO Common Object in Context
  • Pascal VOC Dataset a large-scale open source dataset
  • image processing means 100 displays a necessary image according to the purpose of use using image output control means 110, and image data stored in data storage means 120 is utilized.
  • the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data.
  • open source datasets such as COCO and Pascal VOC Dataset may be used directly as supervised data without using the annotation means 620 if the data has already been annotated.
  • the supervised data is augmented by an Augment means 630 as a learning image 631 to enhance versatility and robustness.
  • the Augment means 630 is, for example, a means for shifting an image to an arbitrary position in the horizontal and vertical directions, a means for enlarging or reducing an image to an arbitrary magnification, a means for rotating an image to an arbitrary angle, and a means for changing the aspect ratio. It is equipped with a means for changing the ratio and a dewarping means for performing distortion correction, cylindrical conversion, etc., and the image is padded by combining various means depending on the purpose of use.
  • the training image 631 padded by the Augment means 630 is input to the deep learning means 640 to calculate the weighting coefficients of the DNN model 310, and the calculated weighting coefficients are converted into the ONNX format for example.
  • a learning dictionary 320 is created. Note that the model learning dictionary 320 may be created by converting into a format other than ONNX format.
  • the deep learning means 640 is realized by an open source learning environment called darknet and an arithmetic processor (including a personal computer and a supercomputer).
  • the darknet has learning parameters called hyperparameters, and it is possible to set appropriate hyperparameters depending on the usage and purpose, and to strengthen versatility and robustness in conjunction with the augment means 630. is also possible.
  • the model learning dictionary 320 created by the deep learning means 640 may be configured by an electronic circuit.
  • a learning environment configured using a programming language may be used depending on the DNN model 310 to be applied.
  • Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610.
  • the image data for validation stored in the learning material database storage means 610 is obtained by utilizing a large-scale open source validation image dataset such as COCO (Common Object in Context) or Pascal VOC Dataset. Anything is fine.
  • images for verifying the detection accuracy, detection performance, versatility, and robustness necessary for the purpose of use are displayed and sent to the data storage means 120 from the image processing means 100 using the image output control means 110, for example. Stored image data may also be used.
  • the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623.
  • open source datasets such as COCO and Pascal VOC Dataset may be used directly as validation data 623 without using the annotation means 620 if the data has already been annotated.
  • the validation data 623 is transferred to a second mAP calculation means 650 equipped with a model post-processing means 400 having inference (prediction) ability equivalent to that of the object detection model 300 and the individual identification means 410 described in the first embodiment.
  • IOU value 653 by comparing groundtruth BBox, which is the correct answer frame, and PredictedBBox (predicted BBox) calculated as a result of inference (prediction), and calculation of all prediction results for all validation data 623.
  • Calculation of Precision 654 which indicates the percentage of the IOU values 653 that were correctly predicted above an arbitrary threshold value, and among the actual correct results, the IOU value 653 was above an arbitrary threshold value and the BBox in a position close to the correct result was predicted.
  • Recall 655 indicating the ratio, AP (Average Precision) value 651 for each class as an index for comparing the accuracy and performance of object detection mentioned above, and mAP (mean Average Precision) value 652 averaged over all classes. It may also be something that calculates.
  • AP Average Precision
  • mAP mean Average Precision
  • the second mAP calculation means 650 uses an open source inference environment called darknet and an arithmetic processor (personal computer or super It is desirable that the object detection model 300 has the same inference (prediction) performance as the object detection model 300.
  • the IOU value is 653, Precision 654, Recall 655, AP value 651, and mAP value 652 calculation means may be provided.
  • the individual identification means 410 of the model post-processing means 400 as described in FIGS. 6A and 6B of the first embodiment is provided with the second mAP calculation means of the second embodiment, so that abnormal data can be Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each exclusion and detection object, the likelihood distribution 540 for the position of each detection object and the average likelihood 501 of the effective area of the likelihood It becomes possible to accurately calculate the likelihood histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
  • the robustness verification means 500 extracts a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, and extracts a position or region where the average likelihood 501 is below an arbitrary threshold value, for each of the various processing parameters 510.
  • the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are calculated using arbitrary thresholds.
  • a learning image is prepared based on the result of the learning reinforcement required item extraction means 530, and a built-in or external dictionary is used. It may be characterized by re-learning by the learning means 600.
  • various processing parameters 510 other than position shift in an arbitrary range near the detected object position such as left, right, top, bottom, and depth of the object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.
  • the neural network itself including the DNN model, will be able to overcome weaknesses in versatility and robustness against various fluctuating conditions and enhancement policies caused by the model learning dictionary 320 created by deep learning etc. It will be possible to separate them from potential issues and accurately understand them. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
  • probability statistical calculation means 520 calculates the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate.
  • the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, the object detection model 300 is improved, and the dictionary learning means 600 repeatedly performs deep learning to solve and strengthen the object detection model, resulting in higher detection ability and a general-purpose model that can be used under various fluctuating conditions. This makes it possible to realize object detection with high performance and robustness.
  • FIG. 18 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention.
  • the performance indexing device, method, and program for the object detection model of the present invention apply to image data generated by an image processing means that acquires an image including a detection target and processes it appropriately.
  • the robustness verification means calculates the average likelihood with respect to the object position fluctuation for each of the various processing parameters Performance indicators such as degree and standard deviation of likelihood are calculated. Furthermore, based on the results of the performance indexing, the dictionary learning means performs robust reinforcement of the model learning dictionary.
  • each component is configured with dedicated hardware, but it may also be realized by executing a software program suitable for each component.
  • Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.
  • the software that implements the performance indexing device and the like of the above embodiment is the following program.
  • this program is a program that causes a computer to execute a performance indexing method.
  • the present invention is useful in the technical field of identifying the position and class of an object in an image using an object detection model. Among these, it is particularly useful in the technical field of reducing the size, power consumption, and cost of cameras and the like for detecting objects.
  • Second performance indexing device 100 Image processing means 101 Lens 102 Image sensor 103, 290 Image processing processor 110 Image output control means 120 Display and data storage means 200 Model preprocessing means 201, 202, 203, 204, 205, 206, 210, 221, 222, 223, 224, 231, 232, 233, 241, 242, 243, 251, 252, 253, 261, 262, 263, 311, 440, 470, 526 Model input images 207, 208, 209, 211, 212, 213, 401, 451, 452, 490, 491 Position information including second detection frame 214, 215, 216, 217, 218, 219, 453, 454, 492, 493 Likelihood in second likelihood information 220 Position shift function 230 Resize function 240 Rotation function 250 Aspect ratio change function 260 Tone conversion function 264, 265, 266 Tone conversion curve 270 Dewarp function 280 Margin padding function 281, 282, 283, 284, 285, 286, 287

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

This performance indexing device comprises: an image processing means that acquires and processes an image; a model preprocessing means that processes an acquired image into a plurality of images in accordance with various processing parameters; an object detection model that includes a model learning dictionary for inferring an object position and a likelihood with respect to the input of the processed plurality of images; a model post-processing means that, for each detection object of the plurality of images, corrects position information that includes a first detection frame, and first likelihood information, so as to obtain position information that includes a second detection frame, and second likelihood information, on the basis of an inference result from the object detection model; and a robustness verification means for verifying the robustness of the object detection model, on the basis of the various processing parameters, and the position information that includes the second detection frame and the second likelihood information that are the output from the model post-processing means.

Description

性能指標化装置、性能指標化方法、及びプログラムPerformance indexing device, performance indexing method, and program
 本発明は、画像中で物体検出を行うモデルの性能とモデル学習辞書の汎用性やロバスト性の弱点や強化方針を正確に分析するための性能指標化装置、性能指標化方法、及びプログラムに関する。 The present invention relates to a performance indexing device, a performance indexing method, and a program for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies.
 近年、AI機能を搭載したエッジAIやクラウドAIが急速に普及してきている。AI(人工知能)は人の脳のニューロンをモデル化したものであり、その中でも画像から物体検出を行うモデルが多岐にわたり開発されている。 In recent years, edge AI and cloud AI equipped with AI functions have become rapidly popular. AI (artificial intelligence) is a model of neurons in the human brain, and a wide variety of models have been developed to detect objects from images.
 一例として、Augmentationにより水増し(拡張データ)された画像データや学習パラメータを用いて物体検出モデルのモデル学習辞書を初期学習、もしくは、再学習する際に、拡張データの品質が学習用データに求められる品質を満たしていなければ、拡張データはノイズとなり、学習の質及び効率を下げる要因となることがあるため、判定対象を表す元データを編集して得られる複数の学習データの編集パラメータを、元データごとに決定する手段と、パラメータに基づいて、元データから、判定対象をそれぞれ表す複数の学習データを生成する手段と、複数の学習データのそれぞれを用いてモデルの学習を行う手段を備えて、学習のための拡張データの品質を向上させる方法が提案されている(特許文献1参照)。 As an example, when initially training or relearning a model learning dictionary of an object detection model using image data and learning parameters that have been inflated (extended data) by Augmentation, the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see Patent Document 1).
特開2021-111228号公報JP 2021-111228 Publication
 しかしながら、物体検出モデルに対する従来の性能指標化装置、性能指標化方法、および、プログラムでは、深層学習などによってモデル学習辞書を学習した際に、汎用性や各種変動条件に対するロバスト性の改善が不十分になる場合があった。 本発明は、上記課題を鑑みてなされたものであり、画像中で物体検出を行うモデルの性能とモデル学習辞書の汎用性やロバスト性の弱点や強化方針を正確に分析するための性能指標化装置、性能指標化方法、及びプログラムを提供することを目的とする。 However, with conventional performance indexing devices, performance indexing methods, and programs for object detection models, when a model learning dictionary is learned by deep learning etc., the versatility and robustness against various fluctuation conditions are insufficiently improved. There were cases where it became. The present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. The purpose is to provide devices, performance indexing methods, and programs.
 本発明の一態様に係る性能指標化装置は、物体検出モデルにおける性能指標化装置であって、画像を取得して適切に加工する画像処理手段と、前記画像処理手段により取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理手段と、前記モデル前処理手段で加工された前記複数の画像の入力に対して物体位置と尤度を推論するモデル学習辞書を含む物体検出モデルと、前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報とを、適正な値である第二の検出枠を含む位置情報と第二の尤度情報とに補正するモデル後処理手段と、前記モデル後処理手段の出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータとをもとに、前記物体検出モデルのロバスト性を検証するロバスト性検証手段とを備える。 A performance indexing device according to one aspect of the present invention is a performance indexing device for an object detection model, and includes an image processing unit that acquires an image and processes it appropriately, and a system that processes various types of images acquired by the image processing unit. an object detection model including a model preprocessing means for processing a plurality of images according to processing parameters; and a model learning dictionary for inferring an object position and likelihood with respect to input of the plurality of images processed by the model preprocessing means; , based on the inference result of the object detection model, position information and first likelihood information including the first detection frame for each detected object in the plurality of images are set to a second detection frame having an appropriate value. a model post-processing means for correcting the position information including the second likelihood information and the second likelihood information, and the position information including the second detection frame which is the output result of the model post-processing means, the second likelihood information and the various types of and robustness verification means for verifying the robustness of the object detection model based on the processing parameters.
 また、本発明の一態様に係る性能指標化方法は、画像を取得して適切に加工する画像処理ステップと、前記画像処理ステップにより取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理ステップと、前記モデル前処理ステップで加工された前記複数の画像の入力に対して物体位置と尤度とを推論するモデル学習辞書を含む物体検出モデルと、前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報とを適正な値である第二の検出枠を含む位置情報と第二の尤度情報とに補正するモデル後処理ステップと、前記モデル後処理ステップの出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータとをもとに前記物体検出モデルのロバスト性を検証するロバスト性検証ステップとを含む。 Further, the performance indexing method according to one aspect of the present invention includes an image processing step of acquiring and appropriately processing an image, and a model of processing the image acquired in the image processing step into a plurality of images according to various processing parameters. a preprocessing step; an object detection model including a model learning dictionary that infers object positions and likelihoods based on the input of the plurality of images processed in the model preprocessing step; and an inference result of the object detection model. Based on the position information including the first detection frame and the first likelihood information for each detection object in the plurality of images, the position information including the second detection frame and the second likelihood, which are appropriate values. a model post-processing step for correcting information, and detecting the object based on position information including a second detection frame that is an output result of the model post-processing step, second likelihood information, and the various processing parameters. and a robustness verification step of verifying the robustness of the model.
 また、本発明の一態様に係るプログラムは、上記に記載の性能指標化方法をコンピュータに実行させるためのプログラムである。 Further, a program according to one aspect of the present invention is a program for causing a computer to execute the performance indexing method described above.
 なお、これらの包括的または具体的な態様は、システム、装置、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なCD-ROMなどの記録媒体で実現されてもよく、システム、装置、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a device, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, and the system, device, integrated circuit, computer program and a recording medium may be used in any combination.
 本発明によれば、画像中で物体検出を行うモデルの性能とモデル学習辞書の汎用性やロバスト性の弱点や強化方針を正確に分析するための性能指標化装置、性能指標化方法、及びプログラムが提供される。 According to the present invention, there is provided a performance indexing device, a performance indexing method, and a program for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. is provided.
本発明の実施形態による物体検出モデルの性能指標化装置を示す図である。1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention; FIG. 人工ニューロンモデルの構成を示す図である。FIG. 2 is a diagram showing the configuration of an artificial neuron model. ある実施形態によるYOLOモデルの構成を示す図である。FIG. 2 is a diagram illustrating the configuration of a YOLO model according to an embodiment. ある実施形態によるYOLOモデルの動作原理を示す図である。FIG. 3 is a diagram illustrating the operating principle of the YOLO model according to an embodiment. 物体検出におけるIOU値の算出概念を示す図である。It is a figure showing the calculation concept of the IOU value in object detection. 本発明の実施形態によるモデル後処理手段の個体識別手段のフローチャートを示す図である。FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention. 本発明の実施形態によるモデル後処理手段の個体識別手段の動作を示す図である。FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention. 本発明の実施形態によるモデル後処理手段の個体識別手段のフローチャートを示す図である。FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention. 本発明の実施形態によるモデル後処理手段の個体識別手段の動作を示す図である。FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention. 従来の物体検出モデルの性能指標化装置の課題を示す第1図である。FIG. 1 is a diagram illustrating problems of a conventional object detection model performance indexing device. 従来の物体検出モデルの性能指標化装置の課題を示す第2図である。FIG. 2 is a second diagram illustrating problems with a conventional object detection model performance indexing device. 本発明の実施形態によるモデル前処理手段の位置シフト機能の動作を示す図である。FIG. 6 is a diagram illustrating the operation of the position shifting function of the model preprocessing means according to an embodiment of the invention. 本発明の実施形態によるモデル前処理手段のリサイズ機能の動作を示す図である。FIG. 6 is a diagram illustrating the operation of the resizing function of the model preprocessing means according to the embodiment of the present invention. 本発明の実施形態によるロバスト性検証手段の確率統計演算手段の動作を示す図である。FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. 本発明の実施形態によるロバスト性検証手段の確率統計演算手段の動作を示す図である。FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. 本発明の実施形態によるロバスト性検証手段の確率統計演算手段の動作を示す図である。FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. 本発明の実施形態によるモデル前処理手段の階調変換機能の動作を示す図である。FIG. 6 is a diagram showing the operation of the tone conversion function of the model preprocessing means according to the embodiment of the present invention. 本発明の実施形態によるモデル前処理手段のアスペクト比変更機能の動作を示す図である。FIG. 6 is a diagram illustrating the operation of the aspect ratio changing function of the model preprocessing means according to the embodiment of the present invention. 本発明の実施形態によるモデル前処理手段の回転機能の動作を示す図である。FIG. 6 is a diagram illustrating the operation of the rotation function of the model preprocessing means according to an embodiment of the invention. 本発明の実施形態による物体検出モデルの性能指標化装置を示す図である。1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention; FIG. 従来の物体検出モデルの性能指標化装置を示す図である。FIG. 2 is a diagram showing a conventional object detection model performance indexing device. 本発明の物体検出モデルの性能指標化装置の要約を示す図である。FIG. 2 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention.
 (開示の基礎となった知見)
 近年、AI機能を搭載したエッジAIやクラウドAIが急速に普及してきている。AI(人工知能)は人の脳のニューロンをモデル化したものであり、その中でも画像から物体検出を行うモデルが多岐にわたり開発されている。人間に例えると、目の情報(画像)から対象物体がどの位置にいるかを検出し、その物体が人か車両か等のどのクラスに当たるかを識別するクラス識別を行うことが一般的である。物体検出モデルは、畳み込み型ニューラルネットワークであるCNN(Convolutional Neural Network)が使われることが多く、近年は、画像データに対してクラス判別と正解枠情報であるgroundtruthBounding Box(以下、ground truth BBoxと称す)を付加した教師データを大量に用意した後、例えば、勾配降下法などを用いて物体か背景かを分類する問題ではバイナリークロスエントロピーを誤差関数とし、groundtruth BBoxとのズレの回帰問題に対しては、L1ノルム(絶対値誤差)を誤差関数として、それらすべての誤差関数を最小化してCNNの重み係数情報(モデル学習辞書)を学習するような深層学習によるEnd―to―End学習方式が主流になっており、物体の位置検出とクラス識別のためのモデルとして、FasterR-CNN、EfficientDet、SSDや、YOLO(You Only Look Once)(例えば、非特許文献1参照)が使用されるケースが増加している。
(Knowledge that formed the basis of disclosure)
In recent years, edge AI and cloud AI equipped with AI functions have become rapidly popular. AI (artificial intelligence) is a model of neurons in the human brain, and a wide variety of models have been developed to detect objects from images. For example, in the case of humans, it is common to detect the position of a target object from eye information (image) and perform class identification to determine which class, such as a person or a vehicle, the object belongs to. CNN (Convolutional Neural Network), which is a convolutional neural network, is often used as an object detection model. ) After preparing a large amount of training data, for example, in a problem of classifying objects or backgrounds using gradient descent, binary cross entropy is used as an error function, and for regression problems of deviations from the ground truth BBox. The mainstream is an end-to-end learning method using deep learning, which uses the L1 norm (absolute value error) as an error function and minimizes all error functions to learn CNN weighting coefficient information (model learning dictionary). FasterR-CNN, EffectiveDet, SSD, and YOLO (You Only Look Once) (for example, see Non-Patent Document 1) are increasingly being used as models for object position detection and class identification. are doing.
 また、物体検出モデルの性能を確認する手段として、対象物体の検出信頼度を示す指標の1つに前述の物体検出モデルの1つであるYOLOの場合は、以下の(式1)に示す信頼度スコアがある(例えば、非特許文献1参照)。信頼度スコアは、一般的には尤度と称される場合もある。 In addition, as a means of checking the performance of an object detection model, in the case of YOLO, which is one of the object detection models mentioned above, one of the indicators indicating the detection reliability of the target object is the reliability shown in (Equation 1) below. There is a degree score (for example, see Non-Patent Document 1). The confidence score is sometimes commonly referred to as likelihood.
 信頼度スコア(尤度) =Pr(Classi|Object)×Pr(Object)×IOUtruthpred(式1) Reliability score (likelihood) = Pr(Classi|Object)×Pr(Object)×IOUTtruthpred (Formula 1)
 ここで、Pr(Classi|Object)は、Object(対象物体)がどのクラスに属するかのクラス確率を示し、すべてのクラス確率を合計すると“1”になるものである。Pr(Object)は、ObjectがBoundingBox(以下BBoxと称す)に含まれている確率を示すものである。IOUtruth predは、正解枠情報であるground truth BBoxとYOLO等のモデルにより予測(推論)したBBoxの2つの枠領域がどのくらい重なっているかを示す指標であり、以下の(式2)に示すIOU(Intersection Over Union)値で算出されるものである。 Here, Pr(Class|Object) indicates the class probability to which class the Object (target object) belongs, and the sum of all class probabilities is "1". Pr(Object) indicates the probability that an Object is included in a BoundingBox (hereinafter referred to as BBox). IOUTtruth pred is an index indicating how much the two frame areas of ground truth BBox, which is the correct frame information, and BBox predicted (inferred) by a model such as YOLO overlap, and IOU( Intersection Over Union) value.
 IOU =Area of Union ÷ Area of Intersection(式2) IOU = Area of Union ÷ Area of Intersection (Formula 2)
 ここで、Area ofUnionは、比較する2つの枠領域の和集合の面積である。Area of Intersectionは、比較する2つの枠領域の共通部分の面積である。 Here, Area of Union is the area of the union of the two frame areas to be compared. Area of Intersection is the area of the common portion of the two frame regions to be compared.
 例えば、カメラで撮影された画像に対して、深層学習されたモデル学習辞書を含め、例えばYOLOにより推論する場合は正解枠であるgroundtruth BBoxなどの教師データが存在しないため、IOUtruthpredを“1”として演算した結果を尤度(信頼度スコア)と称する場合もある。この尤度を用いて、例えばカメラで撮影した画像内の検出対象に対する検出精度および検出性能を指標化することが可能である。また、撮影した画像に対して、正解枠であるgroundtruth BBoxを付加した教師データを作成することにより、本来の信頼度スコア(尤度)とIOU値の算出も可能になるため、画像内の検出対象に対するモデル学習辞書を含めた物体検出モデルの検出精度や検出性能を指標化することが可能である。 For example, when inferring an image taken with a camera using YOLO, including a deep learning model learning dictionary, there is no training data such as groundtruth BBox, which is the correct answer frame, so IOUTtruthpred is set to "1". The calculated result is sometimes referred to as likelihood (reliability score). Using this likelihood, for example, it is possible to index the detection accuracy and detection performance of a detection target in an image taken with a camera. In addition, by creating training data that adds groundtruth BBox, which is a correct answer frame, to the captured image, it becomes possible to calculate the original reliability score (likelihood) and IOU value, so the detection in the image It is possible to index the detection accuracy and detection performance of an object detection model including a model learning dictionary for the target.
 また、物体検出の精度や性能を比較するための指標として、mAP(meanAverage Precision)とAP(AveragePrecision)が使われる場合が多い。(例えば、非特許文献2参照) Additionally, mAP (mean average precision) and AP (average precision) are often used as indicators for comparing the accuracy and performance of object detection. (For example, see Non-Patent Document 2)
 物体検出におけるmAPおよびAPは、以下のような方法で算出される。 mAP and AP in object detection are calculated by the following method.
 複数枚の画像データの対象となる検出物体に対して正解枠であるgroundtruth BBoxを付加したバリデーションデータを用意し、物体検出モデルにより推論(予測)した結果として算出されるPredictedBBox(予測したBBox)と比較してIOU値を算出する。その際、すべてのバリデーションデータの予測結果の内、正しくIOUが任意の閾値以上で予測できた割合を示すPrecisionと、実際の正解結果の内、IOUが任意の閾値以上で正解結果と近い位置のBBoxを予測できた割合を示すRecallを算出する。その際に、バリデーションデータに対して、各識別するクラス毎の前述したObjectがBBoxに含まれている確率が最小である“0”から最大である“1”に到るまでのPrecisionとRecallの2次元グラフの面積の総和をAPとして算出し、さらに全識別クラスについて算出されたAPを平均したものをmAPとして算出するものである。画像内の検出対象に対するモデル学習辞書を含めた物体検出モデルの平均的な検出精度や検出性能の指標化と合わせて、バリデーションデータの選定方法には依存するが、各種のロバスト性に対する性能指標としても活用されることが多い。 Prepare validation data with groundtruth BBox, which is a correct answer frame, added to the detected object that is the target of multiple image data, and calculate PredictedBBox (predicted BBox) as a result of inference (prediction) using the object detection model. Compare and calculate IOU value. At that time, among all the prediction results of validation data, Precision, which indicates the percentage of correctly predicted IOUs with an arbitrary threshold value or more, and among the actual correct results, IOUs with IOUs greater than an arbitrary threshold value and positions close to the correct results. Calculate Recall indicating the rate at which BBox was predicted. At that time, for the validation data, the Precision and Recall values are calculated from "0", which is the minimum probability, to "1", which is the maximum probability that the above-mentioned Object is included in the BBox for each class to be identified. The sum of the areas of the two-dimensional graph is calculated as AP, and the average of the APs calculated for all identification classes is calculated as mAP. In addition to indexing the average detection accuracy and detection performance of the object detection model including the model learning dictionary for the detection target in the image, it can also be used as a performance index for various types of robustness, although it depends on the method of selecting validation data. is also often used.
 図17は、従来の画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための性能指標化装置を示すブロック図である。 FIG. 17 is a block diagram showing a conventional performance indexing device for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
 画像を取得して適切に加工する画像処理手段100は、レンズ(例えば、標準ズーム、広角ズーム、魚眼)と、レンズを通した対象物から発した光を受光し、光の明暗を電気情報に変換するデバイスであるイメージセンサと、黒レベル調整機能、HDR(ハイダイナミックレンジ)合成機能、ゲイン調整機能、露光調整機能、欠陥画素補正機能、シェーディング補正機能、ホワイトバランス機能、色補正機能、ガンマ補正機能、及び、局所トーンマッピング機能等を備えた画像処理プロセッサを有し、撮影環境の中で、照度などの時系列上の変動条件も吸収しながら検出すべき物体を見やすく、もしくは、見つけやすくする画像処理を施す。 The image processing means 100 that acquires and appropriately processes images receives light emitted from a lens (for example, standard zoom, wide-angle zoom, fisheye) and an object passing through the lens, and converts the brightness of the light into electrical information. The image sensor, which is a device that converts into Equipped with an image processing processor equipped with a correction function and a local tone mapping function, it makes it easy to see or find the object to be detected while absorbing time-series fluctuation conditions such as illuminance in the shooting environment. Perform image processing.
 画像処理手段100で生成された画像は、画像出力制御手段110に入力されて、表示およびデータ格納手段120であるモニタやPC(パーソナルコンピュータ)などの外部メモリ、クラウドサーバー等に送信される。 The image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc.
 一方、物体検出モデル300による物体検出を行うために、画像処理手段100により生成された画像データをモデル前処理手段200に入力して、物体検出モデル300の入力に適切な画像となるように加工する。なお、モデル前処理手段200は、電子回路で構成される場合もあれば、アフィン変換関数291や射影変換関数292(ライブラリ)とCPUや演算プロセッサで構成される画像処理プロセッサ290によって実現する場合もある。 On the other hand, in order to perform object detection using the object detection model 300, image data generated by the image processing means 100 is input to the model preprocessing means 200, and processed so that it becomes an image suitable for input to the object detection model 300. do. Note that the model preprocessing means 200 may be configured with an electronic circuit, or may be realized by an image processing processor 290 configured with an affine transformation function 291, a projective transformation function 292 (library), and a CPU or an arithmetic processor. be.
 モデル前処理手段200で加工された画像は、物体検出モデル300に入力されて、推論(予測)により、対象物体がどの位置にいるかを検出されるとともに、その物体が人か車両か等のどのクラスに該当するかを識別(クラス識別)される。その結果として物体検出モデル300から、1つの画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302が出力される。ここで、第一の検出枠を含む位置情報301は、例えば、検出枠の中心座標、水平方向の幅、垂直方向の高さを含む情報であり、第一の尤度情報302は、例えば、検出精度を示す尤度とクラス識別情報である。 The image processed by the model preprocessing means 200 is input to the object detection model 300, and by inference (prediction), it is detected where the target object is, and whether the object is a person or a vehicle. It is identified whether it corresponds to a class (class identification). As a result, from the object detection model 300, for each detected object existing in one image, position information 301 including zero or multiple first detection frames including undetectable and false detection, and first likelihood information 302 is output. Here, the position information 301 including the first detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the first likelihood information 302 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
 物体検出モデル300は、例えば、モデル学習辞書320と、畳み込み型ニューラルネットワーク(CNN)を使用したディープニューラルネットワーク(DNN)モデル310で構成される。DNNモデル310は、例えば、検出処理速度に優位性の高いモデルであるYOLO(例えば、非特許文献1参照)やSSDなどを使用する場合がある。また、検出精度を優先する際は、例えば、FasterR-CNNやEfficientDetなどを使用する場合もある。また、物体の位置検出は行わずにクラス識別を中心に実施する際は、例えば、MobileNetなどを使用する場合もある。モデル学習辞書320は、DNNモデル310の重み係数のデータの集合体であり、DNNモデル310の場合は、深層学習手段640により初期学習、もしくは、再学習されるものである。 The object detection model 300 includes, for example, a model learning dictionary 320 and a deep neural network (DNN) model 310 using a convolutional neural network (CNN). The DNN model 310 may use, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with high superiority in detection processing speed. Furthermore, when priority is given to detection accuracy, FasterR-CNN, EfficientDet, or the like may be used, for example. Furthermore, when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used. The model learning dictionary 320 is a collection of weighting coefficient data of the DNN model 310, and in the case of the DNN model 310, it is initially learned or re-learned by the deep learning means 640.
 物体検出モデル300から出力された1つの画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302は、モデル後処理手段400に入力した後、第一の検出枠を含む位置情報301の相互のIOU値による選別や第一の尤度情報302の最大判定などにより、各検出物体にする最も適切と考えられる第二の検出枠を含む位置情報401と第二の尤度情報402に補正されて、表示およびデータ格納手段120であるモニタやPC(パーソナルコンピュータ)などの外部メモリ、クラウドサーバー等に送信される。ここで、第二の検出枠を含む位置情報401は、例えば、検出枠の中心座標、水平方向の幅、垂直方向の高さを含む情報であり、第二の尤度情報402は、例えば、検出精度を示す尤度とクラス識別情報である。 For each detected object existing in one image output from the object detection model 300, position information 301 including zero or multiple first detection frames including undetectable and false detection and first likelihood information 302 are , after inputting it to the model post-processing means 400, the most appropriate one for each detection object is selected by sorting the position information 301 including the first detection frame based on mutual IOU values, determining the maximum of the first likelihood information 302, etc. The position information 401 including the possible second detection frame and the second likelihood information 402 are corrected and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc. be done. Here, the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
 これら、画像処理手段100とモデル前処理手段200と物体検出モデル300とモデル後処理手段400により第二の検出枠を含む位置情報401と第二の尤度情報402を生成する一連の手段が、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための第一の性能指標化装置30である。 A series of means for generating position information 401 including the second detection frame and second likelihood information 402 by these image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This is a first performance indexing device 30 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
 次に、モデル学習辞書320を作成するための深層学習の一例について説明する。 Next, an example of deep learning for creating the model learning dictionary 320 will be described.
 最初に、大規模なオープンソースのデータセットなど深層学習のための素材データが保存されている学習用素材データベース格納手段610から、使用目的に適切と考えられる学習用素材データを抽出する。なお、学習のための素材データは、使用用途に応じて必要となる画像を、例えば、画像処理手段100から画像出力制御手段110を使って表示およびデータ格納手段120に格納した画像データを活用する場合もある。 First, learning material data considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data for deep learning such as large-scale open source datasets are stored. Note that, as material data for learning, necessary images depending on the purpose of use are utilized, for example, image data that is displayed from the image processing means 100 using the image output control means 110 and stored in the data storage means 120. In some cases.
 次に、学習用素材データベース格納手段610から抽出された学習用素材データに対して、アノテーション手段620によって、クラス識別情報と正解枠であるgroundtruth BBoxを付加して教師ありデータを作成する。 Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data.
 次に、アノテーション手段620によって生成された教師ありデータは、Augment手段630によって、汎用性およびロバスト性を強化するために学習用画像631として水増しする。 Next, the supervised data generated by the annotation means 620 is augmented by the augmentation means 630 as a learning image 631 in order to enhance versatility and robustness.
 次に、学習用画像631を深層学習手段640に入力して、DNNモデル310の重み係数を算出し、算出された重み係数を、例えば、ONNXフォーマットに変換してモデル学習辞書320を作成する。モデル学習辞書320を物体検出モデル300に反映することで、画像内の物体の位置検出やクラス識別を行うことが可能となる。 Next, the learning image 631 is input to the deep learning means 640, the weighting coefficient of the DNN model 310 is calculated, and the calculated weighting coefficient is converted into, for example, ONNX format to create the model learning dictionary 320. By reflecting the model learning dictionary 320 in the object detection model 300, it becomes possible to detect the position of the object in the image and identify the class.
 次に、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための第二の性能指標化装置40の一例について説明する。 Next, an example of the second performance indexing device 40 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class will be described.
 前述した、学習用素材データベース格納手段610から、使用目的に対する必要な検出精度や検出性能や汎用性、および、ロバスト性を検証するためのバリデーション用素材データを抽出する。バリデーション用素材データは、使用目的に対する必要な検出精度や検出性能や汎用性、および、ロバスト性を検証するための画像を、例えば、大規模なオープンソースのデータセットや画像処理手段100から画像出力制御手段110を使って表示およびデータ格納手段120に格納した画像データを活用する場合もある。 Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610. Validation material data is an image output from a large-scale open source dataset or image processing means 100, for example, to verify the detection accuracy, detection performance, versatility, and robustness required for the purpose of use. In some cases, the control means 110 is used to display the image data and the image data stored in the data storage means 120 is utilized.
 次に、学習用素材データベース格納手段610から抽出されたバリデーション用素材データに対して、アノテーション手段620によって、クラス識別情報と正解枠であるgroundtruth BBoxを付加してバリデーション用データ623を作成する。 Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623.
 次に、バリデーション用データ623を物体検出モデル300と同等の推論(予測)が可能な第一のmAP算出手段660に入力して、正解枠であるgroundtruth BBoxと推論(予測)した結果として算出されるPredictedBBox(予測したBBox)を比較したIOU値653の算出と、すべてのバリデーション用データ623に対するすべての予測結果の内、正しくIOU値653が任意の閾値以上で予測できた割合を示すPrecision654の算出と、実際の正解結果の内、IOU値653が任意の閾値以上で正解結果と近い位置のBBoxを予測できた割合を示すRecall655の算出と、前述した物体検出の精度や性能を比較するための指標としての各クラス別のAP(AveragePrecision)値651と、全クラスを平均化したmAP(mean Average Precision)値652を算出するものである(例えば、非特許文献2参照)。ここで、第一のmAP算出手段660は、例えば、DNNモデル310にYOLOを適用した際は、darknetと呼ばれるオープンソースの推論環境と演算プロセッサ(パーソナルコンピュータやスーパーコンピュータを含む)を備えたものであり、物体検出モデル300と同等の推論(予測)性能を有していることが望ましい。さらに、前述したIOU値653とPrecision654とRecall655とAP値651とmAP値652の算出手段を備えるものである。 Next, the validation data 623 is input to the first mAP calculation means 660 that is capable of inference (prediction) equivalent to that of the object detection model 300, and the mAP is calculated as a result of inference (prediction) with the groundtruth BBox that is the correct answer frame. Calculation of IOU value 653 by comparing Predicted BBox (predicted BBox), and calculation of Precision 654 which indicates the percentage of all prediction results for all validation data 623 where IOU value 653 was correctly predicted at or above an arbitrary threshold value. Then, among the actual correct results, calculation of Recall 655, which indicates the proportion of BBoxes near the correct result whose IOU value 653 is greater than or equal to an arbitrary threshold, can be predicted, and the above-mentioned object detection accuracy and performance are compared. AP (Average Precision) value 651 for each class as an index and mAP (mean Average Precision) value 652 averaged over all classes are calculated (for example, see Non-Patent Document 2). Here, for example, when YOLO is applied to the DNN model 310, the first mAP calculation means 660 is equipped with an open source inference environment called darknet and an arithmetic processor (including a personal computer and a supercomputer). It is desirable that the object detection model 300 has the same inference (prediction) performance as the object detection model 300. Furthermore, it is provided with means for calculating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 described above.
 これら、学習用素材データベース格納手段610とアノテーション手段620と第一のmAP算出手段660によりIOU値653とPrecision654とRecall655とAP値651とmAP値652を生成する一連の手段が、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための第二の性能指標化装置40である。 A series of means for generating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 by the learning material database storage means 610, the annotation means 620, and the first mAP calculation means 660 is the This is a second performance indexing device 40 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that performs position detection and class identification.
 一方で、Augmentationにより水増し(拡張データ)された画像データや学習パラメータを用いて物体検出モデルのモデル学習辞書を初期学習、もしくは、再学習する際に、拡張データの品質が学習用データに求められる品質を満たしていなければ、拡張データはノイズとなり、学習の質及び効率を下げる要因となることがあるため、判定対象を表す元データを編集して得られる複数の学習データの編集パラメータを、元データごとに決定する手段と、パラメータに基づいて、元データから、判定対象をそれぞれ表す複数の学習データを生成する手段と、複数の学習データのそれぞれを用いてモデルの学習を行う手段を備えて、学習のための拡張データの品質を向上させる方法が提案されている(例えば、特許文献1参照)。 On the other hand, when initially training or relearning a model learning dictionary of an object detection model using image data and learning parameters that have been inflated (extended data) by Augmentation, the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see, for example, Patent Document 1).
 しかしながら、物体検出モデルに対する従来の性能指標化装置、方法、および、プログラムでは、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題と、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を正確に切り分けることが困難であった。そのため、ニューラルネットワークの性能不全やモデル学習辞書の汎用性やロバスト性の強化不全の要因となる。 However, conventional performance indexing devices, methods, and programs for object detection models have problems with accuracy and performance during inference that the neural network itself, including the DNN model in the object detection model, potentially has, and deep learning It has been difficult to accurately identify weaknesses in versatility and robustness against various fluctuating conditions and reinforcement policies caused by model learning dictionaries created by such methods. Therefore, this becomes a cause of poor performance of the neural network and failure to enhance the versatility and robustness of the model learning dictionary.
 カメラなどにより取得した画像中で物体検出を行うモデルにおける汎用性やロバスト性の項目や各種変動条件は、例えば、背景(景色)、カメラのレンズ仕様、カメラを取り付ける高さや仰俯角など、画像サイズを含む検出対象領域と視野範囲など、魚眼レンズを使用している場合のデワープ処理方法、日光や照明に依存する照度変化や黒つぶれや白飛び、逆光などの特殊条件、晴れ、曇り、雨、雪、霧などの天候条件、対象検出物体の画像中の位置(左右上下と奥行)やサイズや輝度レベルや色情報を含む形状や特徴やアスペクト比や回転角度、対象検出物体の数や相互重複の状態や付属物の種類や大きさや付属位置、レンズのIRカットの有無、対象検出物体の移動速度、および、カメラ自体の移動速度などがある。 Versatility and robustness items and various fluctuation conditions for a model that detects objects in images acquired by a camera etc. include the background (scenery), camera lens specifications, image size, such as the height and elevation/depression angle at which the camera is mounted. Detection target area and field of view, dewarp processing method when using a fisheye lens, special conditions such as changes in illuminance depending on sunlight and lighting, crushed shadows, blown highlights, backlight, etc., sunny, cloudy, rain, snow , weather conditions such as fog, position (left, right, top, bottom, and depth) of the target detection object in the image, size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, and mutual overlap. These include the state, the type, size, and position of the attached object, whether or not the lens has IR cut, the moving speed of the object to be detected, and the moving speed of the camera itself.
 第一の課題として、図17に示す画像処理手段100とモデル前処理手段200と物体検出モデル300とモデル後処理手段400により第二の検出枠を含む位置情報401と第二の尤度情報402を生成する一連の第一の性能指標化装置30、性能指標化方法(以下、単に方法と称する場合がある)、および、プログラムを使用して物体検出モデルの性能を指標化する際は、画像中の検出対象の位置や大きさが時系列で揺らぐ場合に、同一の物体を検出しているにも関わらず、DNNモデルの構成条件とアルゴリズムに起因する課題により、推論(予測)した検出枠を含む位置情報と尤度情報に特有のパターンでバラツキが生じる場合がある。その現象は、物体検出を行うためのカメラなどを小型化、省電力化、および、低コスト化する際に、搭載するDSP(デジタルシグナルプロセッサ)などの演算プロセッサの性能の制限等によりDNNモデルに入力する画像サイズを小さくした場合などで、特に顕著に表れると考えられる。例えば、物体の位置の検出とクラス識別を同時に行うため処理速度に優位性が高いとされるYOLOなどに代表されるone―stage型のDNNモデルを使用する場合は、図10に示すように、画像中の物体の位置を数ピクセル(画素)単位で水平方向と垂直方向に位置シフトして作成した複数枚の画像に対して、推論して検出物体の位置に対する尤度分布を確認すると、検出物体の位置により特有の格子状のパターンで尤度が低下する場所が存在する場合がある。これは、例えば、YOLOの場合は、図3Bに示すように、物体の位置の検出とクラス識別を同時に行うために領域を任意のサイズのグリッドセルに分割してクラス確率を演算するために発生する潜在的な課題と考えられる。一方、物体の位置の検出とクラス識別を2段階に分けて処理するEfficientDetなどに代表されるtwo―stage型のDNNモデルを使用する場合は、前述のone―stage型のDNNモデルほどの課題は発生しにくい場合が多いが、検出速度が低下するため使用用途によっては適用が困難な場合があった。 As a first problem, the image processing means 100, model preprocessing means 200, object detection model 300, and model postprocessing means 400 shown in FIG. When indexing the performance of an object detection model using a series of first performance indexing devices 30, a performance indexing method (hereinafter sometimes simply referred to as a method), and a program that generates an image, When the position and size of the detection target fluctuates over time, even though the same object is being detected, the detection frame is inferred (predicted) due to the configuration conditions of the DNN model and issues caused by the algorithm. Variations may occur in specific patterns in location information and likelihood information, including This phenomenon occurs when cameras for detecting objects are made smaller, consume less power, and reduce costs due to limitations in the performance of arithmetic processors such as DSPs (digital signal processors) installed in DNN models. This problem is thought to be particularly noticeable when the input image size is reduced. For example, when using a one-stage DNN model such as YOLO, which is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class, as shown in Figure 10, If you check the likelihood distribution for the position of the detected object by inference for multiple images created by shifting the position of the object in the image in units of several pixels in the horizontal and vertical directions, the detection Depending on the location of the object, there may be locations where the likelihood decreases in a unique grid pattern. For example, in the case of YOLO, this occurs because the area is divided into grid cells of arbitrary size and class probabilities are calculated in order to detect the object position and identify the class at the same time, as shown in Figure 3B. This is considered a potential issue. On the other hand, when using a two-stage DNN model such as EfficientDet, which processes object position detection and class identification in two stages, it does not have as many issues as the one-stage DNN model mentioned above. Although this is unlikely to occur in many cases, it may be difficult to apply depending on the intended use because the detection speed decreases.
 このため、画像に対して原画のみ(ピンポイント)で検出枠を含む位置情報と尤度情報を推論(予測)する性能指標化装置、方法、および、プログラムでは、正確な検出精度や性能が把握できないため、ニューラルネットネットワークそのものの潜在的な課題の抽出や解決手段が策定できない場合があった。さらに、物体検出モデルの構成要素の1つであるモデル学習辞書の弱点や強化が必要な条件の把握が不十分になる。したがって、深層学習などによってモデル学習辞書を学習した際に、汎用性や各種変動条件に対するロバスト性の改善が不十分になる場合があった。 For this reason, performance indexing devices, methods, and programs that infer (predict) position information and likelihood information including the detection frame using only the original image (pinpoint) for the image are difficult to grasp with accurate detection accuracy and performance. As a result, it was sometimes impossible to identify potential problems with the neural network itself or to formulate solutions. Furthermore, it becomes insufficient to understand the weaknesses of the model learning dictionary, which is one of the components of the object detection model, and the conditions that require reinforcement. Therefore, when a model learning dictionary is learned by deep learning or the like, improvements in versatility and robustness against various fluctuation conditions may not be sufficient.
 第二の課題として、図17に示す学習用素材データベース格納手段610とアノテーション手段620と第一のmAP算出手段660によりIOU値653とPrecision654とRecall655とAP値651とmAP値652を生成する一連の画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための第二の性能指標化装置40、方法、および、プログラムを使用して物体検出モデルの性能を指標化する際は、検証のために選定したバリデーションデータに対する全体的、および、平均的な検出精度や検出性能は把握できるが、各種変動条件に対する汎用性やロバスト性を詳細に把握することはできない。また、図17の第一のmAP算出手段660に対して、各バリデーションデータが有する各種変動条件に対して、モデル学習辞書の汎用性やロバスト性を分離して解析する際に、前述した第一の性能指標化装置30、方法、および、プログラムを適用した場合には、前述した第一の課題に示すように、正確な検出精度や性能が把握できない場合があるため、モデル学習辞書の弱点や強化が必要な条件の把握が不十分になる。したがって、深層学習などによってモデル学習辞書を学習した際に、汎用性や各種変動条件に対するロバスト性の改善が不十分になる場合があった。 As a second problem, a series of steps to generate an IOU value 653, a Precision 654, a Recall 655, an AP value 651, and an mAP value 652 using the learning material database storage means 610, annotation means 620, and first mAP calculation means 660 shown in FIG. The second performance indexing device 40, method, and program are used to analyze the robustness and reinforcement policy of a model learning dictionary for a model that detects the position and class of an object in an image. When indexing performance, it is possible to understand the overall and average detection accuracy and detection performance for the validation data selected for verification, but it is important to understand in detail the versatility and robustness against various fluctuation conditions. I can't. In addition, for the first mAP calculation means 660 in FIG. 17, the first When the performance indexing device 30, method, and program of Conditions that require reinforcement will not be fully understood. Therefore, when a model learning dictionary is learned by deep learning or the like, improvements in versatility and robustness against various fluctuation conditions may not be sufficient.
 本発明は、上記課題に鑑みてなされたものであり、画像中で物体検出を行うモデルの性能とモデル学習辞書の汎用性やロバスト性の弱点や強化方針を正確に分析するための性能指標化装置、方法、及びプログラムを提供することを目的とする。さらに、物体検出を行うためのカメラなどを小型化、省電力化、および、低コスト化するために、搭載するDSP(デジタルシグナルプロセッサ)などの演算プロセッサの性能の制限を設けた場合でも、物体検出の精度や性能を担保するための性能指標化装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. The purpose is to provide devices, methods, and programs. Furthermore, in order to reduce the size, power consumption, and cost of cameras for detecting objects, even if performance limitations are placed on the arithmetic processors such as DSPs (digital signal processors) installed, The purpose is to provide a performance indexing device, method, and program to ensure detection accuracy and performance.
 (開示の概要)
 本発明の第1態様に係る性能指標化装置は、物体検出モデルにおける性能指標化を行う装置であって、画像を取得して適切に加工する画像処理手段と、前記画像処理手段により取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理手段と、前記モデル前処理手段で加工された前記複数の画像に対して物体位置と尤度(確からしさの度合い)を推論するモデル学習辞書を含む物体検出モデルと、前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報を適正な値である第二の検出枠を含む位置情報と第二の尤度情報に補正するモデル後処理手段と、前記モデル後処理手段の出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータをもとに前記物体検出モデルのロバスト性を検証するロバスト性検証手段を有することを特徴とする。
(Summary of disclosure)
A performance indexing device according to a first aspect of the present invention is a device that performs performance indexing in an object detection model, and includes an image processing means for acquiring and appropriately processing an image, and an image processing means for acquiring an image and appropriately processing the image. A model preprocessing means for processing an image into a plurality of images according to various processing parameters, and a model learning for inferring an object position and likelihood (degree of certainty) for the plurality of images processed by the model preprocessing means. An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model. model post-processing means for correcting position information including a second detection frame and second likelihood information; and position information including the second detection frame and second likelihood that are output results of the model post-processing unit. The present invention is characterized by comprising a robustness verification means for verifying the robustness of the object detection model based on the information and the various processing parameters.
 ある実施形態に対応する第2態様に係る性能指標化装置は、第1態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、S(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M個の位置シフト画像を生成する。 A performance indexing device according to a second aspect corresponding to an embodiment is the performance indexing device according to the first aspect, wherein the model preprocessing means processes the plurality of images input to the object detection model. When processing, as the various processing parameters, a position shift of N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction is used in S (any decimal) pixel steps. A total of N×M position shifted images are generated.
 ある実施形態に対応する第3態様に係る性能指標化装置は、第1又は第2態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、さらに、L(任意の整数)種類の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、該画像をS(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M×L個の位置シフト画像を生成する。 A performance indexing device according to a third aspect corresponding to a certain embodiment is the performance indexing device according to the first or second aspect, wherein the model preprocessing means When processing an image of In the (pixel) step, a total of N×M×L position-shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction.
 ある実施形態に対応する第4態様に係る性能指標化装置は、第1~第3態様のいずれか1態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、さらに、P(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線を使用して、輝度レベルを任意の値に変更した画像を生成する。 A performance indexing device according to a fourth aspect corresponding to an embodiment is the performance indexing device according to any one of the first to third aspects, wherein the model preprocessing means When processing the plurality of images input to the , the brightness level is set to an arbitrary value using P (arbitrary integer) types of contrast correction curves or gradation conversion curves as the various processing parameters. Generate a modified image.
 ある実施形態に対応する第5態様に係る性能指標化装置は、第1~第4態様のいずれか1態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、さらに、Q(任意の整数)種類のアスペクト比率を使用して、アスペクト比を変更した画像を生成する。 A performance indexing device according to a fifth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to fourth aspects, wherein the model preprocessing means When processing the plurality of images input to the image processing apparatus, Q (arbitrary integer) types of aspect ratios are further used as the various processing parameters to generate images with changed aspect ratios.
 ある実施形態に対応する第6態様に係る性能指標化装置は、第1~第5態様のいずれか1態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、さらに、R(任意の整数)種類の角度を使用して、回転角度を変更した画像を生成する。 A performance indexing device according to a sixth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to fifth aspects, wherein the model preprocessing means When processing the plurality of images that are input to the image processing apparatus, R (arbitrary integer) types of angles are further used as the various processing parameters to generate images with changed rotation angles.
 ある実施形態に対応する第7態様に係る性能指標化装置は、第1~第6態様のいずれか1態様に記載の性能指標化装置であって、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、該加工により発生する有効画像が存在しない余白部分に、該有効画像の平均輝度レベルを貼り付けて画像を生成する。 A performance indexing device according to a seventh aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to sixth aspects, wherein the model preprocessing means When processing the plurality of images input to the computer, an image is generated by pasting the average luminance level of the effective images in a blank space where no effective images are generated due to the processing.
 ある実施形態に対応する第8態様に係る性能指標化装置は、第1~第7態様のいずれか1態様に記載の性能指標化装置であって、前記モデル後処理手段は、前記複数の画像の1つの画像中に存在する前記物体検出モデルの出力結果の1つないし複数の前記検出物体毎に、検出不能と疑似検出を含むゼロないし複数の前記第一の検出枠を含む位置情報と前記第一の尤度情報に対して、該第一の尤度情報に対する任意の閾値T(任意の小数)と相互の該第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(任意の小数)により前記検出物体毎に最尤の前記第二の検出枠を含む位置情報と前記第二の尤度情報に補正する個体識別手段を有することを特徴とする。 A performance indexing device according to an eighth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to seventh aspects, wherein the model post-processing means For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and Indicates how much an arbitrary threshold T (arbitrary decimal number) for the first likelihood information overlaps with the region of position information including the first detection frame with respect to the first likelihood information. Correct the position information including the second detection frame with the maximum likelihood and the second likelihood information for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index. It is characterized by having individual identification means.
 ある実施形態に対応する第9態様に係る性能指標化装置は、第1~第8態様のいずれか1態様に記載の性能指標化装置であって、前記モデル後処理手段は、前記検出物体毎に正解となる検出枠を含む位置情報とクラス識別情報が存在する場合は、前記各種加工パラメータの内容にしたがって該正解となる検出枠を含む位置情報を補正する機能を有し、前記複数の画像の1つの画像中に存在する前記物体検出モデルの出力結果の1つないし複数の前記検出物体毎に、検出不能と疑似検出を含むゼロないし複数の前記第一の検出枠を含む位置情報と前記第一の尤度情報に対して、該第一の尤度情報に対する任意の閾値T(任意の小数)と、該正解となる検出枠を含む位置情報と該第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(任意の小数)により前記検出物体毎に最尤の前記第二の検出枠を含む位置情報と前記第二の尤度情報に補正する個体識別手段を有することを特徴とする。 A performance indexing device according to a ninth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to eighth aspects, wherein the model post-processing means If there is positional information and class identification information that include a correct detection frame, the function corrects the positional information that includes the correct detection frame according to the contents of the various processing parameters, and For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and For the first likelihood information, an arbitrary threshold T (arbitrary decimal number) for the first likelihood information, position information including the correct detection frame, and position information including the first detection frame. The maximum likelihood position information including the second detection frame for each detected object is calculated using an arbitrary threshold value U (an arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index showing how much the regions overlap. It is characterized by comprising an individual identification means for correcting the second likelihood information.
 ある実施形態に対応する第10態様に係る性能指標化装置は、第8又は第9態様に記載の性能指標化装置であって、前記モデル後処理手段は、前記モデル前処理手段の前記複数の画像の加工に使用した前記各種加工パラメータと、前記個体識別手段の出力結果を、前記検出物体毎に個別に紐づけて、前記ロバスト性検証手段に出力する。 A performance indexing device according to a tenth aspect corresponding to an embodiment is the performance indexing device according to the eighth or ninth aspect, wherein the model post-processing means is configured to The various processing parameters used in image processing and the output results of the individual identification means are individually linked for each detected object and output to the robustness verification means.
 ある実施形態に対応する第11態様に係る性能指標化装置は、第2態様を引用する第2~第10態様のいずれか1態様、又は、第3態様を引用する第3~第10態様のいずれか1態様に記載の性能指標化装置であって、前記ロバスト性検証手段は、前記モデル後処理手段の出力結果である前記第二の検出枠を含む位置情報と前記第二の尤度情報の中の尤度をもとに、前記各種加工パラメータ別に、前記検出物体毎の前記位置シフトに伴うバラツキを示す尤度分布と、該尤度の有効領域の平均値である平均尤度と、該尤度のヒストグラムと、該尤度の有効領域の標準偏差である尤度の標準偏差と、該尤度の有効領域の最大値である最大尤度と、該尤度の有効領域の最小値である最小尤度と、該尤度に対するIOU値のいずれか、もしくは、すべてを算出する確率統計演算手段を備えることを特徴とする。 A performance indexing device according to an eleventh aspect corresponding to a certain embodiment includes any one of the second to tenth aspects that cite the second aspect, or the third to tenth aspects that cite the third aspect. The performance indexing device according to any one aspect, wherein the robustness verification means includes position information including the second detection frame, which is an output result of the model post-processing means, and the second likelihood information. a likelihood distribution indicating the variation accompanying the position shift for each of the detected objects for each of the various processing parameters based on the likelihood in the above, and an average likelihood that is the average value of the effective area of the likelihood; A histogram of the likelihood, the standard deviation of the likelihood which is the standard deviation of the valid area of the likelihood, the maximum likelihood which is the maximum value of the valid area of the likelihood, and the minimum value of the valid area of the likelihood The present invention is characterized by comprising probability statistical calculation means for calculating either or all of the minimum likelihood and the IOU value corresponding to the likelihood.
 ある実施形態に対応する第12態様に係る性能指標化装置は、第2態様を引用する第2~第11態様のいずれか1態様、又は、第3態様を引用する第3~第11態様のいずれか1態様に記載の性能指標化装置であって、前記ロバスト性検証手段は、前記検出物体毎に正解となる検出枠を含む位置情報と正解となるクラス識別情報が存在する場合は、前記モデル後処理手段の出力結果である前記第二の検出枠を含む位置情報と該正解となる検出枠を含む位置情報とのIOU値と前記第二の尤度情報の中のクラス識別情報と該正解となるクラス識別情報から算出されたクラス識別正解率をもとに、前記各種加工パラメータ別に、該IOU値と該クラス識別正解率に対する前記検出物体毎の前記位置シフトに伴うバラツキを示すIOU分布とクラス識別正解率分布、該IOU値と該クラス識別正解率の有効領域の平均値である平均IOU値と平均クラス識別正解率と、該IOU値のヒストグラムと該クラス識別正解率のヒストグラムと、該IOU値と該クラス識別正解率の有効領域の標準偏差であるIOU値の標準偏差とクラス識別正解率の標準偏差と、該IOU値と該クラス識別正解率の有効領域の最大値である最大IOU値と最大クラス識別正解率、該IOU値と該クラス識別正解率の有効領域の最小値である最小IOU値と最小クラス識別正解率のいずれか、もしくは、すべてを算出する前記確率統計演算手段を備えることを特徴とする。 A performance indexing device according to a twelfth aspect corresponding to a certain embodiment may be any one of the second to eleventh aspects that cite the second aspect, or the third to 11th aspects that cite the third aspect. In the performance indexing device according to any one aspect, if position information including a detection frame that is a correct answer and class identification information that is a correct answer exist for each of the detected objects, the robustness verification means detects the The IOU value of the position information including the second detection frame which is the output result of the model post-processing means, the position information including the correct detection frame, the class identification information in the second likelihood information, and the IOU distribution showing the variation due to the position shift for each detected object with respect to the IOU value and the class identification correct answer rate for each of the various processing parameters, based on the class identification correct answer rate calculated from the class identification information that is the correct answer. and a class identification correct answer rate distribution, an average IOU value and an average class identification correct answer rate that are the average values of the effective area of the IOU value and the class identification correct answer rate, a histogram of the IOU value and a histogram of the class identification correct answer rate, The standard deviation of the IOU value, which is the standard deviation of the effective area of the IOU value and the class identification correct answer rate, the standard deviation of the class identification correct answer rate, and the maximum value, which is the maximum value of the effective area of the IOU value and the class identification correct answer rate. The probability statistical calculation means calculates either or all of the IOU value and the maximum class identification accuracy rate, the minimum IOU value and the minimum class identification accuracy rate that are the minimum values of the effective area of the IOU value and the class identification accuracy rate. It is characterized by having the following.
 ある実施形態に対応する第13態様に係る性能指標化装置は、第11態様を引用する第11又は第12態様に記載の性能指標化装置であって、前記ロバスト性検証手段は、さらに、前記各種加工パラメータ別に、前記検出物体毎の前記尤度分布における任意の閾値以下となる位置もしくは領域の抽出と、前記平均尤度が任意の閾値以下となる該検出物体の抽出と、前記尤度の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記最大尤度が任意の閾値以下となる該検出物体の抽出と、前記最小尤度が任意の閾値以下となる該検出物体の抽出と、前記IOU値が任意の閾値以下となる該検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段を有することを特徴とする。 A performance indexing device according to a thirteenth aspect corresponding to a certain embodiment is a performance indexing device according to the eleventh or twelfth aspect that cites the eleventh aspect, and the robustness verification means further comprises: For each processing parameter, extraction of a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, extraction of the detected object where the average likelihood is below an arbitrary threshold value, and extraction of the detected object where the average likelihood is below an arbitrary threshold value, and Extraction of the detected object whose standard deviation is greater than or equal to an arbitrary threshold, extraction of the detected object whose maximum likelihood is equal to or less than an arbitrary threshold, and extraction of the detected object whose minimum likelihood is equal to or less than an arbitrary threshold. The present invention is characterized by having a learning reinforcement necessary item extracting unit that extracts any or all of the detected objects whose IOU value is equal to or less than an arbitrary threshold value.
 ある実施形態に対応する第14態様に係る性能指標化装置は、第12態様を引用する第12又は第13態様に記載の性能指標化装置であって、前記ロバスト性検証手段は、さらに、前記各種加工パラメータ別に、前記検出物体毎の前記IOU分布における任意の閾値以下となる位置もしくは領域の抽出と、前記クラス識別正解率分布における任意の閾値以下となる位置もしくは領域の抽出と、前記平均IOU値が任意の閾値以下となる該検出物体の抽出と、前記平均クラス識別正解率が任意の閾値以下となる該検出物体の抽出と、前記IOU値の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記クラス識別正解率の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記最大IOU値が任意の閾値以下となる該検出物体の抽出と、前記最大クラス識別正解率が任意の閾値以下となる該検出物体の抽出と、前記最小IOU値が任意の閾値以下となる該検出物体の抽出と、前記最小クラス識別正解率が任意の閾値以下となる該検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段を有することを特徴とする。 A performance indexing device according to a fourteenth aspect corresponding to a certain embodiment is a performance indexing device according to the twelfth or thirteenth aspect that cites the twelfth aspect, and the robustness verification means further comprises: For each of the various processing parameters, extracting a position or area where the IOU distribution for each detected object is below an arbitrary threshold value, extracting a position or area where the class identification accuracy rate distribution is below an arbitrary threshold value, and extracting the average IOU extraction of the detected object whose value is equal to or less than an arbitrary threshold; extraction of the detected object whose average class identification accuracy rate is equal to or less than the arbitrary threshold; and detection of the detected object whose standard deviation of the IOU value is equal to or greater than the arbitrary threshold. extraction of an object, extraction of the detected object for which the standard deviation of the class identification accuracy rate is greater than or equal to an arbitrary threshold, extraction of the detected object for which the maximum IOU value is less than or equal to the arbitrary threshold, and the maximum class identification correct answer. extraction of the detected object for which the rate is below an arbitrary threshold; extraction of the detected object for which the minimum IOU value is below the arbitrary threshold; and extraction of the detected object for which the minimum IOU value is below the arbitrary threshold. The present invention is characterized by having a means for extracting necessary items for learning reinforcement that includes any or all of the extraction methods.
 ある実施形態に対応する第15態様に係る性能指標化装置は、第14態様に記載の性能指標化装置であって、前記ロバスト性検証手段の前記確率統計演算手段、および、前記学習強化必要項目抽出手段は、前記尤度と前記IOU値と前記クラス識別正解率をもとにした確率統計演算の際に、対象となる検出物体に関係する画素が任意の割合で欠落している画像に対しては、演算対象から除外するような機能を備えることを特徴とする。 A performance indexing device according to a fifteenth aspect corresponding to a certain embodiment is the performance indexing device according to the fourteenth aspect, in which the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item The extraction means is configured to perform a probability statistical calculation based on the likelihood, the IOU value, and the class classification correct answer rate for an image in which pixels related to the target detection object are missing at an arbitrary rate. It is characterized by having a function to exclude it from calculation targets.
 ある実施形態に対応する第16態様に係る性能指標化装置は、第13~第15態様のいずれか1態様に記載の性能指標化装置であって、前記確率統計演算手段の出力に基づき分析した結果、前記モデル学習辞書が性能不十分であると判断した場合は、前記学習強化必要項目抽出手段の結果に基づいて、学習画像を準備して、内蔵もしくは外部の辞書学習手段により前記モデル学習辞書を再学習することを特徴とする。 A performance indexing device according to a sixteenth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the thirteenth to fifteenth aspects, which performs analysis based on the output of the probability statistical calculation means. As a result, if it is determined that the performance of the model learning dictionary is insufficient, a learning image is prepared based on the result of the learning reinforcement necessary item extraction means, and the model learning dictionary is used by the built-in or external dictionary learning means. It is characterized by relearning.
 ある実施形態に対応する第17態様に係る性能指標化装置は、第1~第16態様のいずれか1態様に記載の性能指標化装置であって前記物体検出モデルは、深層学習により作成されたモデル学習辞書を含むニューラルネットワークであることを特徴とする。 A performance indexing device according to a seventeenth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to sixteenth aspects, wherein the object detection model is created by deep learning. It is characterized by being a neural network that includes a model learning dictionary.
 本発明の第18態様に係る性能指標化方法は、物体検出モデルにおける性能指標化を行う方法であって、画像を取得して適切に加工する画像処理ステップと、前記画像処理ステップにより取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理ステップと、前記モデル前処理ステップで加工された前記複数の画像に対して物体位置と尤度(確からしさの度合い)を推論するモデル学習辞書を含む物体検出モデルと、前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報を適正な値である第二の検出枠を含む位置情報と第二の尤度情報に補正するモデル後処理ステップと、前記モデル後処理ステップの出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータをもとに前記物体検出モデルのロバスト性を検証するロバスト性検証ステップを有し、前記各手段をステップとして実行する方法であることを特徴とする。 A performance indexing method according to an eighteenth aspect of the present invention is a method for creating a performance index in an object detection model, which includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image. A model preprocessing step for processing an image into multiple images according to various processing parameters, and model learning for inferring object positions and likelihoods (degrees of certainty) for the multiple images processed in the model preprocessing step. An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model. a model post-processing step that corrects the position information including the second detection frame and second likelihood information; and the position information including the second detection frame and the second likelihood that are the output results of the model post-processing step. The method is characterized in that it includes a robustness verification step of verifying the robustness of the object detection model based on information and the various processing parameters, and the method executes each of the means as steps.
 本発明の第19態様に係る性能指標化プログラムは、物体検出モデルにおける性能指標化をコンピュータに実行させるためのプログラムであって、画像を取得して適切に加工する画像処理ステップと、前記画像処理ステップにより取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理ステップと、前記モデル前処理ステップで加工された前記複数の画像に対して物体位置と尤度(確からしさの度合い)を推論するモデル学習辞書を含む物体検出モデルと、前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報を適正な値である第二の検出枠を含む位置情報と第二の尤度情報に補正するモデル後処理ステップと、前記モデル後処理ステップの出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータをもとに前記物体検出モデルのロバスト性を検証するロバスト性検証ステップを有し、前記各手段と前記各ステップを機能させるためコンピュータに実行させるためのプログラムであることを特徴とする。 A performance indexing program according to a nineteenth aspect of the present invention is a program for causing a computer to perform performance indexing in an object detection model, and includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image. A model preprocessing step for processing the image acquired in the step into a plurality of images according to various processing parameters, and an object position and likelihood (degree of certainty) for the plurality of images processed in the model preprocessing step. an object detection model including a model learning dictionary that infers the object detection model; and position information including a first detection frame and first likelihood information for each detected object in the plurality of images based on the inference result of the object detection model. a model post-processing step for correcting position information including a second detection frame that is an appropriate value and second likelihood information; and position information including the second detection frame that is an output result of the model post-processing step. a robustness verification step of verifying the robustness of the object detection model based on the second likelihood information and the various processing parameters; It is characterized by being a program.
 なお、これらの包括的または具体的な態様は、システム、装置、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なCD-ROMなどの記録媒体で実現されてもよく、システム、装置、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a device, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, and the system, device, integrated circuit, computer program and a recording medium may be used in any combination.
 本発明によれば、物体検出モデルにおける性能指標化を行うに際して、画像処理手段からの画像を基本画像として、モデル前処理手段により各種加工パラメータとして、S(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M個の位置シフト画像を生成し、複数の画像の各々に対して、物体検出モデルにより検出枠を含む位置情報と尤度情報を推論した後、モデル後処理手段で個別識別が可能なように補正してロバスト性検証手段により各検出物体の位置に対する尤度分布を確認することにより、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になるため、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題を正確に抽出することが可能となる。さらに、課題を解決するための手法や方式を効果的に策定することができるため、物体検出モデルの検出精度や検出性能の向上が可能となる。 According to the present invention, when performing performance indexing in an object detection model, the image from the image processing means is used as a basic image, and the model preprocessing means uses it as various processing parameters in S (arbitrary decimal) pixel (pixel) steps. , a total of N×M position shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction, and for each of the multiple images, After inferring the position information and likelihood information including the detection frame using the object detection model, the model post-processing means corrects it so that individual identification is possible, and the robustness verification means calculates the likelihood distribution for the position of each detected object. By confirming the It becomes possible to accurately extract potential issues related to accuracy and performance during inference that the neural network itself, including the model, has latently. Furthermore, since it is possible to effectively formulate methods and methods for solving problems, it is possible to improve the detection accuracy and detection performance of the object detection model.
 本発明によれば、さらに、ロバスト性検証手段が検出物体毎の位置シフトに伴うバラツキを示す尤度分布と、尤度の有効領域の平均値である平均尤度と、尤度のヒストグラムと、尤度の有効領域の標準偏差である尤度の標準偏差と、尤度の有効領域の最大値である最大尤度と、尤度の有効領域の最小値である最小尤度と、尤度に対するIOU値のいずれか、もしくは、すべてを算出する確率統計演算手段を備えることにより、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になる。したがって、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。さらに、課題を解決するための手法や方式をより効果的に策定することができるため、物体検出モデルのさらなる検出精度や検出性能の向上が可能となる。さらに、位置シフト以外の各種加工パラメータと組み合わせた際には、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, the robustness verification means further includes a likelihood distribution indicating variations due to a position shift for each detected object, an average likelihood that is an average value of a valid region of likelihood, and a histogram of likelihoods. The standard deviation of likelihood, which is the standard deviation of the valid region of likelihood, the maximum likelihood, which is the maximum value of the valid region of likelihood, the minimum likelihood, which is the minimum value of the valid region of likelihood, and the By being equipped with a probability statistical calculation means that calculates any or all of the IOU values, we can extract features whose likelihood fluctuates due to fluctuations in the position of the detected object on the screen due to potential problems with object detection models. It becomes possible to do so. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters other than position shift, we will examine weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and reinforcement policies for neural networks including DNN models. This makes it possible to separate and accurately understand potential issues with the network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
 本発明によれば、正解となる検出枠を含む位置情報と正解となるクラス識別情報が存在する場合は、さらに、ロバスト性検証手段が検出物体毎の位置シフトに伴うバラツキを示すIOU分布とクラス識別正解率分布、平均IOU値と、平均クラス識別正解率と、IOU値のヒストグラムと、クラス識別正解率のヒストグラムと、IOU値の標準偏差と、クラス識別正解率の標準偏差と、最大IOU値と、最大クラス識別正解率と、最小IOU値と、最小クラス識別正解率のいずれか、もしくは、すべてを算出する確率統計演算手段を備えることにより、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって検出枠を含む位置情報やクラス識別情報が変動する特徴を抽出することが可能になる。したがって、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。さらに、課題を解決するための手法や方式をより効果的に策定することができるため、物体検出モデルのさらなる検出精度や検出性能の向上が可能となる。さらに、位置シフト以外の各種加工パラメータと組み合わせた際には、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, when position information including a detection frame that is a correct answer and class identification information that is a correct answer exist, the robustness verification means further calculates the IOU distribution and class Identification accuracy rate distribution, average IOU value, average class identification accuracy rate, histogram of IOU values, histogram of class identification accuracy rate, standard deviation of IOU value, standard deviation of class identification accuracy rate, maximum IOU value , the maximum class identification accuracy rate, the minimum IOU value, and the minimum class identification accuracy rate, or all of them. It becomes possible to extract a feature in which position information including the detection frame and class identification information fluctuate due to fluctuations in the position of the detected object on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters other than position shift, we will examine weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and reinforcement policies for neural networks including DNN models. This makes it possible to separate and accurately understand potential issues with the network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
 本発明によれば、さらに、ロバスト性検証手段の確率統計演算手段、および、学習強化必要項目抽出手段は、尤度とIOU値とクラス識別正解率をもとにした確率統計演算の際に、対象となる検出物体に関係する画素が任意の割合で欠落している画像に対しては、演算対象から除外するような機能を備えることにより、検証の基準となる画像中の物体の位置やモデル前処理手段の各種加工パラメータの加工後の物体の位置に依存して検出対象物体の有効範囲が欠落するような場合でも、正確な物体検出モデルの性能および特徴とモデル学習辞書の汎用性およびロバスト性を検証することが可能となる。したがって、検出物体サイズに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item extraction means further perform the probability statistical calculation based on the likelihood, IOU value, and class classification correct answer rate. By providing a function that excludes images that are missing a certain percentage of pixels related to the target detection object from calculation targets, the position and model of the object in the image can be used as the basis for verification. Even in cases where the effective range of the object to be detected is missing depending on the position of the object after processing various processing parameters of the preprocessing means, the performance and features of the accurate object detection model and the versatility and robustness of the model learning dictionary It becomes possible to verify the gender. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary.
 本発明によれば、さらに、モデル前処理手段により各種加工パラメータとして、L(任意の整数)種類の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、前述の位置シフト画像を生成することで、確率統計演算手段を備えるロバスト性検証手段によりL種類のサイズ別に各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値を確認することが可能となる。さらに、L種類のサイズ別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を確認することが可能となる。したがって、検出物体サイズに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, the model preprocessing means further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnifications as various processing parameters, and then generates the above-mentioned position shifted image. By doing so, the robustness verification means equipped with the probability statistical calculation means calculates the likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, and the standard deviation of the likelihood for the position of each detected object for each of the L types of sizes. It becomes possible to check the maximum likelihood, minimum likelihood, and IOU value. Furthermore, check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each L type of size, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. becomes possible. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary.
 さらに、モデル前処理手段により各種加工パラメータとして、P(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線を使用して、輝度レベルを任意の値に変更した画像を生成することで、確率統計演算手段を備えるロバスト性検証手段によりP(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線別に各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値を確認することが可能となる。さらに、P(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を確認することが可能となる。したがって、天候条件や撮影時間帯や撮影環境の照度条件により変化する検出物体と背景の輝度レベルに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, by using the model preprocessing means as various processing parameters, P (arbitrary integer) types of contrast correction curves or gradation conversion curves are used to generate an image in which the brightness level is changed to an arbitrary value. , the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object for each of P (arbitrary integer) types of contrast correction curves or gradation conversion curves are determined by a robustness verification means equipped with a probability statistical calculation means. It becomes possible to check the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, the distribution, histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate. You can check the histogram, standard deviation, maximum value, and minimum value. Therefore, it is possible to improve the DNN model and strengthen the versatility and robustness of the model learning dictionary with respect to the brightness levels of the detected object and background that change depending on the weather conditions, shooting time, and illuminance conditions of the shooting environment.
 さらに、モデル前処理手段により各種加工パラメータとして、Q(任意の整数)種類のアスペクト比率を使用して、アスペクト比を変更した画像を生成することで、確率統計演算手段を備えるロバスト性検証手段によりQ(任意の整数)種類のアスペクト比率別に各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値を確認することが可能となる。さらに、Q(任意の整数)種類のアスペクト比率別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を確認することが可能となる。したがって、検出物体の様々なアスペクト比に対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, by using Q (arbitrary integer) types of aspect ratios as various processing parameters by the model preprocessing means and generating images with changed aspect ratios, the robustness verification means equipped with the probability statistical calculation means The likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the likelihood distribution for each detected object position for each aspect ratio of Q (arbitrary integer) types It becomes possible to check the IOU value. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. , it becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various aspect ratios of the detected object and to enhance the versatility and robustness of the model learning dictionary.
 さらに、モデル前処理手段により各種加工パラメータとして、R(任意の整数)種類の角度を使用して、回転角度を変更した画像を生成することで、確率統計演算手段を備えるロバスト性検証手段によりR(任意の整数)種類の角度別に各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値を確認することが可能となる。さらに、R(任意の整数)種類の角度別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を確認することが可能となる。したがって、検出物体の様々な傾きに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, the model preprocessing means uses R (arbitrary integer) types of angles as various processing parameters to generate images with changed rotation angles, and the robustness verification means equipped with probability statistical calculation means (Arbitrary integer) Likelihood distribution for the position of each detected object by type of angle, average likelihood of the valid area of likelihood, histogram of likelihood, standard deviation of likelihood, maximum likelihood, minimum likelihood, and IOU value It becomes possible to confirm. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate, It becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various inclinations of the detected object and to enhance the versatility and robustness of the model learning dictionary.
 さらに、モデル後処理手段は、各出力結果と各種加工パラメータを検出物体毎に個別に紐づけてロバスト性検証手段に出力する一連の手段により、各種加工パラメータ別に、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になる。したがって、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。 Furthermore, the model post-processing means uses a series of means for individually linking each output result and various processing parameters for each detected object and outputting them to the robustness verification means. For this purpose, it becomes possible to extract features whose likelihood changes due to fluctuations in the position of the detected object in the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently.
 さらに、学習強化必要項目抽出手段を備えたロバスト性検証手段の結果に基づいて、学習画像を準備して、内蔵もしくは外部の辞書学習手段によりモデル学習辞書を再学習することにより、検出物体近傍の任意の範囲の位置シフト以外の各種加工パラメータ(画面中の物体の左右上下と奥行などの位置、物体サイズ、コントラスト、階調、アスペクト比率、回転など)と組み合わせた際には、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できる。したがって、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, based on the results of the robustness verification means equipped with a means for extracting necessary items for learning reinforcement, training images are prepared and the model learning dictionary is retrained by the built-in or external dictionary learning means, thereby determining the area near the detected object. When combined with various processing parameters other than position shift in any range (position such as left, right, top, bottom, and depth of an object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.), deep learning etc. It is possible to accurately understand weaknesses and reinforcement policies in generality and robustness against various fluctuation conditions caused by the model learning dictionary that is created, by separating them from potential issues with neural networks themselves, including DNN models. Become. Therefore, effective learning image data and supervised data can be applied to deep learning and the like. Therefore, it is possible to enhance the versatility and robustness of the model learning dictionary.
 本発明によれば、モデル前処理手段により物体検出モデルに入力する複数の画像を加工するに際して、加工により発生する有効画像が存在しない余白部分に、有効画像の平均輝度レベルを貼り付けて画像を生成することにより、余白部分の特徴量が、物体検出モデルの推論精度に与える影響を軽減できるため、各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値をより正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値をより正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化をより正確に図ることが可能となる。 According to the present invention, when a plurality of images to be input to an object detection model are processed by the model preprocessing means, the average brightness level of the valid images is pasted to the blank area where no valid images exist due to the processing. By generating the likelihood distribution for each detected object position, the average likelihood of the effective area of the likelihood, and the likelihood histogram, the influence of the features in the margin area on the inference accuracy of the object detection model can be reduced. It becomes possible to more accurately calculate the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
 本発明によれば、さらに、モデル後処理手段が、画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報と第一の尤度情報に対して、第一の尤度情報に対する任意の閾値T(任意の小数)と相互の第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(任意の小数)により検出物体毎に最尤の第二の検出枠を含む位置情報と第二の尤度情報に補正する個体識別手段を有することにより、異常データの排除と検出物体毎に検出枠を含む位置情報と尤度情報を適した情報に補正することができるため、各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値をより正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値をより正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化をより正確に図ることが可能となる。 According to the present invention, the model post-processing means further includes position information including zero or a plurality of first detection frames including undetectable and false detection frames and a first likelihood for each detection object present in the image. IOU (Intersection over By having an individual identification means that corrects the position information and second likelihood information including the second detection frame with the maximum likelihood for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the union) value, abnormal data can be detected. It is possible to correct the position information and likelihood information including the detection frame for each detected object to appropriate information, so the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object can be corrected. It becomes possible to more accurately calculate the histogram of the degree, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
 本発明によれば、さらに、モデル後処理手段が、検出物体毎に正解となる検出枠を含む位置情報とクラス識別情報が存在する場合は、各種加工パラメータの内容にしたがって正解となる検出枠を含む位置情報を補正する機能を有し、画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報と第一の尤度情報に対して、第一の尤度情報に対する任意の閾値T(任意の小数)と正解となる検出枠を含む位置情報と第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(任意の小数)により検出物体毎に最尤の第二の検出枠を含む位置情報と第二の尤度情報に補正する個体識別手段を有することにより、異常データの排除と検出物体毎に検出枠を含む位置情報と尤度情報を最適な情報に補正することができるため、各検出物体の位置に対する尤度分布と尤度の有効領域の平均尤度と尤度のヒストグラムと尤度の標準偏差と最大尤度と最小尤度とIOU値を正解データとの比較により正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を正解データとの比較により正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化をより正確に図ることが可能となる。 According to the present invention, if there is position information and class identification information including a correct detection frame for each detected object, the model post-processing means selects the correct detection frame according to the contents of various processing parameters. It has a function to correct the position information contained in the image, and for each detection object present in the image, it corrects the position information and first likelihood information including zero or multiple first detection frames including undetectable and false detection. is an index that indicates how much overlap between an arbitrary threshold value T (an arbitrary decimal number) for the first likelihood information and the area of the positional information including the correct detection frame and the area of the positional information including the first detection frame. It has an individual identification means that corrects each detected object to position information and second likelihood information including a second detection frame with the maximum likelihood based on an arbitrary threshold value U (arbitrary decimal number) for a certain IOU (Intersection over Union) value. By doing this, it is possible to eliminate abnormal data and correct the position information including the detection frame and likelihood information for each detected object to the optimal information, so the likelihood distribution and the effective area of the likelihood for the position of each detected object can be corrected. It becomes possible to accurately calculate the average likelihood, the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value by comparing them with the correct data. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
 さらに、バリデーションデータを用いた全体的、および、平均的な推論精度や性能の指標である、IOU値とPrecisionとRecallとAP値とmAP値をより正確に算出することが可能となり、全体的な物体検出モデル300とモデル学習辞書320の指標化の精度が向上する。 Furthermore, it becomes possible to more accurately calculate the IOU value, Precision, Recall, AP value, and mAP value, which are indicators of overall and average inference accuracy and performance using validation data. The accuracy of indexing the object detection model 300 and the model learning dictionary 320 is improved.
 本発明によれば、ロバスト性検証手段が、さらに、各種加工パラメータ別に、検出物体毎の尤度分布における任意の閾値以下となる位置もしくは領域の抽出と、平均尤度が任意の閾値以下となる検出物体の抽出と、尤度の標準偏差が任意の閾値以上となる検出物体の抽出と、最大尤度が任意の閾値以下となる検出物体の抽出と、最小尤度が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段を有することにより、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、より正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, the robustness verification means further extracts, for each processing parameter, a position or region where the likelihood distribution for each detected object is below an arbitrary threshold, and where the average likelihood is below an arbitrary threshold. Extraction of detected objects, extraction of detected objects whose standard deviation of likelihood is greater than or equal to an arbitrary threshold, extraction of detected objects whose maximum likelihood is less than or equal to an arbitrary threshold, and extraction of detected objects whose minimum likelihood is less than or equal to an arbitrary threshold. By having a method for extracting items necessary for learning reinforcement that includes any or all of the extraction of detected objects, it is possible to eliminate weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and strengthen them. It becomes possible to understand the policy more accurately by separating it from the potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
 本発明によれば、ロバスト性検証手段が、さらに、各種加工パラメータ別に、検出物体毎のIOU分布における任意の閾値以下となる位置もしくは領域の抽出と、検出物体毎のクラス識別正解率分布における任意の閾値以下となる位置もしくは領域の抽出と、平均IOU値が任意の閾値以下となる検出物体の抽出と、平均クラス識別正解率が任意の閾値以下となる検出物体の抽出と、IOU値の標準偏差が任意の閾値以上となる検出物体の抽出と、クラス識別正解率の標準偏差が任意の閾値以上となる検出物体の抽出と、最大IOU値が任意の閾値以下となる検出物体の抽出と、最大IOU値が任意の閾値以下となる検出物体の抽出と、最小IOU値が任意の閾値以下となる検出物体の抽出のいずれか、最小クラス識別正解率が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段を有することにより、検出枠を含む位置情報やクラス識別情報をもとに、深層学習などにより作成されるモデル学習辞書に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、より正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to the present invention, the robustness verification means further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and selects an arbitrary value in the class identification accuracy rate distribution for each detected object, for each various processing parameters. Extraction of a position or area where the average IOU value is below a threshold value, Extraction of detected objects whose average IOU value is below an arbitrary threshold value, Extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and IOU value standard Extracting a detected object whose deviation is greater than or equal to an arbitrary threshold; Extracting a detected object whose standard deviation of class identification accuracy is greater than or equal to an arbitrary threshold; Extracting a detected object whose maximum IOU value is less than or equal to an arbitrary threshold; Extraction of detected objects whose maximum IOU value is below an arbitrary threshold, extraction of detected objects whose minimum IOU value is below an arbitrary threshold, or extraction of detected objects whose minimum class identification accuracy rate is below an arbitrary threshold By having a learning reinforcement necessary item extraction means that includes any or all of the above, the versatility resulting from the model learning dictionary created by deep learning etc. based on position information including the detection frame and class identification information. It becomes possible to more accurately understand weak points in robustness and reinforcement policies against various fluctuation conditions by separating them from potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
 以下、図面を参照しながら本発明の実施形態を説明する。同様の構成要素には同様の参照符号を付し、同様の説明の繰り返しは省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Similar components are given the same reference numerals, and similar explanations will not be repeated.
 (実施形態1)
 図1は、本発明の実施形態1による画像中の物体検出における性能指標化装置10を示すブロック図である。
(Embodiment 1)
FIG. 1 is a block diagram showing a performance indexing device 10 for detecting objects in images according to Embodiment 1 of the present invention.
 なお、後述する本発明の実施形態1に記載している各手段、各機能、および、各工程は、それぞれをステップに、各装置は、それぞれを方法に置き換えても良い。また、本発明の実施形態1に記載している各手段と各装置は、コンピュータにより実行されるプログラムとして実現されても良い。 Note that each means, each function, and each process described in Embodiment 1 of the present invention described later may be replaced with a step, and each device may be replaced with a method. Moreover, each means and each device described in Embodiment 1 of the present invention may be realized as a program executed by a computer.
 画像を取得して適切に加工する画像処理手段100は、レンズ101と、レンズを通した対象物から発した光を受光し、光の明暗を電気情報に変換するデバイスであるイメージセンサ102と、黒レベル調整機能、HDR(ハイダイナミックレンジ)合成機能、ゲイン調整機能、露光調整機能、欠陥画素補正機能、シェーディング補正機能、ホワイトバランス機能、色補正機能、ガンマ補正機能、局所トーンマッピング機能等を備えた画像処理プロセッサ103を主として構成される。また、前述の機能以外のものも備えるものであっても良い。レンズ101は、例えば、物体検出の使用用途に応じて、標準ズームレンズ、広角ズームレンズ、魚眼レンズなどを使用するものであっても良い。検出対象を撮影する環境の中で、照度などの時系列上の変動条件を画像処理プロセッサ103に搭載されている各種機能により、検出、および、制御して、変動を抑制しながら検出すべき物体を見やすく、もしくは、見つけやすくする画像処理を施す。 The image processing means 100 that acquires and appropriately processes images includes a lens 101, an image sensor 102 that is a device that receives light emitted from an object through the lens, and converts the brightness of the light into electrical information. Equipped with black level adjustment function, HDR (high dynamic range) composition function, gain adjustment function, exposure adjustment function, defective pixel correction function, shading correction function, white balance function, color correction function, gamma correction function, local tone mapping function, etc. The main component is an image processing processor 103. Further, functions other than those described above may also be provided. The lens 101 may be, for example, a standard zoom lens, a wide-angle zoom lens, a fisheye lens, or the like, depending on the purpose of object detection. In the environment in which the object to be detected is photographed, time-series fluctuation conditions such as illuminance are detected and controlled by various functions installed in the image processing processor 103, and the object to be detected is detected while suppressing fluctuations. Apply image processing to make it easier to see or find.
 画像処理手段100で生成された画像は、画像出力制御手段110に入力されて、表示およびデータ格納手段120であるモニタ機器やPC(パーソナルコンピュータ)などの外部メモリ、および、クラウドサーバー等に送信される。画像出力制御手段110は、例えば、表示およびデータ格納手段120の水平および垂直同期信号に従って画像データを伝送する機能を有するものであっても良い。また、画像出力制御手段110は、モデル後処理手段400の出力結果である第二の検出枠を含む位置情報401と第二の尤度情報402を参照して、検出した物体にマーキングするように枠描写や尤度情報を出力する画像に重畳させる機能を有するものであっても良い。また、シリアル通信機能やパラレル通信機能や双方を変換するUARTなどにより、第二の検出枠を含む位置情報401と第二の尤度情報402を直接、表示およびデータ格納手段120に伝送するものであってもよい。 The image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor device, an external memory such as a PC (personal computer), a cloud server, etc. Ru. The image output control means 110 may have a function of transmitting image data according to horizontal and vertical synchronization signals of the display and data storage means 120, for example. The image output control means 110 also refers to the position information 401 including the second detection frame, which is the output result of the model post-processing means 400, and the second likelihood information 402, so as to mark the detected object. It may also have a function of superimposing frame depiction and likelihood information on the output image. Further, the position information 401 including the second detection frame and the second likelihood information 402 are directly transmitted to the display and data storage means 120 using a serial communication function, a parallel communication function, or a UART that converts both. There may be.
 一方、物体検出モデル300による物体検出を行うために、画像処理手段100により生成された画像データをモデル前処理手段200に入力して、物体検出モデル300の入力に適切な画像となるようにモデル入力画像210に加工する。ここで、物体検出モデル300が、輝度レベルのみの画像データを使用して物体検出を行うモデルである場合は、画像処理手段100で生成される物体検出のための画像は、輝度レベルのみを有する輝度データに変換されたものでも良く、物体検出モデル300が、色情報を含むカラー画像データを使用して物体検出を行うモデルである場合は、画像処理手段100で生成される物体検出のための画像は、RGBなどの画素を有するカラー画像データであっても良い。本実施形態1は、一例として、物体検出モデル300が、輝度レベルのみの画像データを使用して物体検出を行うモデルであり、画像処理手段100で生成される物体検出のための画像は、輝度レベルのみを有する輝度データに変換されたものである場合に関して説明する。 On the other hand, in order to perform object detection using the object detection model 300, the image data generated by the image processing means 100 is input to the model preprocessing means 200, and the model is processed so that the image is suitable for input to the object detection model 300. The input image is processed into an input image 210. Here, if the object detection model 300 is a model that performs object detection using image data with only brightness levels, the image for object detection generated by the image processing means 100 has only brightness levels. If the object detection model 300 is a model that performs object detection using color image data including color information, the object detection model generated by the image processing means 100 may be converted into brightness data. The image may be color image data having pixels such as RGB. In the first embodiment, as an example, the object detection model 300 is a model that performs object detection using image data of only the brightness level, and the image for object detection generated by the image processing means 100 is A case where the luminance data is converted into luminance data having only levels will be explained.
 なお、モデル前処理手段200は、加算器、減算器、乗算器、除算器、比較器などの電子回路で構成される場合もあれば、アフィン変換関数291や射影変換関数292などの関数(ライブラリ)や魚眼レンズを使用して撮影した画像を人間の視野相当に変換するための歪補正テーブル293と、CPUや演算プロセッサで構成される画像処理プロセッサ290によって実現する場合もある。なお、画像処理プロセッサ290は、画像処理手段100が有する画像処理プロセッサ103で代用しても良い。モデル前処理手段200は、上述したアフィン変換関数291や射影変換関数292や画像処理プロセッサ290、もしくは、電子回路を使用して、特定の領域を切り出す機能と、特定の領域を切り出す際に画像を水平方向と垂直方向の任意の位置にシフトさせるための位置シフト機能220と、任意の倍率に拡大や縮小するためのリサイズ機能230と、画像を任意の角度に回転させるための回転機能240と、水平方向と垂直方向の比率を任意に変形するためのアスペクト比変更機能250と、輝度レベルを任意の曲線で変更するための階調変換機能260と、歪補正や円筒変換などを行うためのデワープ機能270と、有効な画素が存在しない領域に任意の輝度レベルをパディングする余白パディング機能280などの一部もしくはすべてを備えるものであっても良い。なお、モデル前処理手段200は、物体検出モデル300の性能指標化のために、画像処理手段100により生成された画像データを基準画像として、性能指標化の目的にしたがって様々な各種加工パラメータ510によって、複数のモデル入力画像210に加工して物体検出モデル300に出力するものであり、後述するロバスト性検証手段500の説明の中で、その使用方法や動作を説明する。なお、本実施形態1は、一例として、物体検出モデル300が、輝度レベルのみの画像データを使用して物体検出を行うモデルであり、モデル前処理手段200で生成される物体検出のためのモデル入力画像210は、輝度レベルのみを有する輝度データに変換されたものである場合に関して説明する。 The model preprocessing means 200 may be configured with electronic circuits such as adders, subtracters, multipliers, dividers, and comparators, or may be configured with functions (library ) or a fisheye lens to an image equivalent to a human visual field, and an image processing processor 290 comprising a CPU or an arithmetic processor. Note that the image processing processor 290 may be replaced by the image processing processor 103 included in the image processing means 100. The model preprocessing means 200 uses the above-mentioned affine transformation function 291, projective transformation function 292, image processing processor 290, or electronic circuit to have a function of cutting out a specific area, and a function of cutting out an image when cutting out a specific area. A position shift function 220 for shifting the image to an arbitrary position in the horizontal and vertical directions, a resizing function 230 for enlarging or reducing the image to an arbitrary magnification, and a rotation function 240 for rotating the image to an arbitrary angle. An aspect ratio change function 250 for arbitrarily changing the ratio between the horizontal and vertical directions, a gradation conversion function 260 for changing the brightness level with an arbitrary curve, and a dewarp function for performing distortion correction, cylindrical conversion, etc. It may include some or all of the function 270 and a margin padding function 280 for padding an area where no valid pixels exist with an arbitrary brightness level. Note that, in order to create a performance index for the object detection model 300, the model preprocessing means 200 uses the image data generated by the image processing means 100 as a reference image, and processes it using various processing parameters 510 according to the purpose of creating a performance index. , is processed into a plurality of model input images 210 and output to the object detection model 300, and its usage and operation will be explained in the explanation of the robustness verification means 500 described later. Note that in the first embodiment, as an example, the object detection model 300 is a model that performs object detection using image data of only the brightness level, and the object detection model 300 is a model for object detection generated by the model preprocessing means 200. A case will be described in which the input image 210 is converted into luminance data having only luminance levels.
 モデル前処理手段200で加工された画像は、物体検出モデル300に入力されて、推論(予測)により、対象物体がどの位置にいるかが検出されるとともに、その物体が人か車両か等のどのクラスに該当するかを識別(クラス識別)される。その結果として物体検出モデル300から、1つの画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302が出力される。ここで、第一の検出枠を含む位置情報301は、例えば、検出枠の中心座標、水平方向の幅、垂直方向の高さを含む情報であり、第一の尤度情報302は、例えば、検出精度を示す尤度とクラス識別情報である。 The image processed by the model preprocessing means 200 is input to the object detection model 300, and by inference (prediction), it is detected where the target object is, and whether the object is a person or a vehicle. It is identified whether it corresponds to a class (class identification). As a result, from the object detection model 300, for each detected object existing in one image, position information 301 including zero or multiple first detection frames including undetectable and false detection, and first likelihood information 302 is output. Here, the position information 301 including the first detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the first likelihood information 302 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
 物体検出モデル300は、例えば、モデル学習辞書320と、AI(人工知能)は人の脳のニューロンをモデル化したものである畳み込み型ニューラルネットワーク(CNN)を使用したディープニューラルネットワーク(DNN)モデル310で構成される。DNNモデル310は、例えば、検出処理速度に優位性の高いモデルであるYOLO(例えば、非特許文献1参照)やSSDなどを使用する。また、検出精度を優先する際は、例えば、FasterR-CNNやEfficientDetなどを使用する場合もある。また、物体の位置検出は行わずにクラス識別を中心に実施する際は、例えば、MobileNetなどを使用する場合もある。 The object detection model 300 is, for example, a deep neural network (DNN) model 310 that uses a model learning dictionary 320 and a convolutional neural network (CNN), which is a model of human brain neurons. Consists of. The DNN model 310 uses, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with a high advantage in detection processing speed. Furthermore, when priority is given to detection accuracy, FasterR-CNN, EfficientDet, or the like may be used, for example. Furthermore, when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used.
 図2に、前述したCNNの基本構成となる人工ニューロンモデル330とニューラルネットワーク340の構成概略を示す。人工ニューロンモデル330は、図2と(式3)に示すように、X0、X1、…、Xmなど1つ以上のニューロンの出力信号を受け取り、それぞれの重み係数W0、W1、…、Wmとの乗算結果の総和に対して、活性化関数350を通して次のニューロンへの出力を生成するものである。bは、バイアス(オフセット)である。 FIG. 2 shows a schematic configuration of an artificial neuron model 330 and a neural network 340, which are the basic configuration of the CNN described above. As shown in FIG. 2 and (Equation 3), the artificial neuron model 330 receives the output signals of one or more neurons such as X0, An output to the next neuron is generated through an activation function 350 for the sum of the multiplication results. b is a bias (offset).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 また、それら多数の人工ニューロンモデルの集合体が、ニューラルネットワーク340である。ニューラルネットワーク340は、入力層、中間層、出力層で構成され、それぞれの人工ニューロンモデル330の出力が次段の各人工ニューロンモデル330に入力されていくものである。人工ニューロンモデル330は、電子回路などのハードウェアや演算プロセッサとプログラムで実現されるものであっても良い。例えば、深層学習によって、各人工ニューロンモデル330の重み係数を辞書データとして算出するものである。辞書データ、すなわち、図1に示すモデル学習辞書320は、ニューラルネットワーク340により構成されるDNNモデル310の重み係数のデータの集合体であり、DNNモデル310の場合は、後述する辞書学習手段600により初期学習、もしくは、再学習されるものである。 Furthermore, a collection of these many artificial neuron models is a neural network 340. The neural network 340 is composed of an input layer, an intermediate layer, and an output layer, and the output of each artificial neuron model 330 is input to each artificial neuron model 330 at the next stage. The artificial neuron model 330 may be realized by hardware such as an electronic circuit, an arithmetic processor, and a program. For example, the weighting coefficients of each artificial neuron model 330 are calculated as dictionary data using deep learning. The dictionary data, that is, the model learning dictionary 320 shown in FIG. It is something that is initially learned or re-learned.
 次に、活性化関数350に関して説明する。線形を繰り返しても線形のみに変換するだけなので、活性化関数350は、非線形変換である必要があることが知られている。活性化関数350は、単純に“0”か“1”に識別するステップ関数やシグモイド関数351やランプ関数などが使われるが、シグモイド関数351は、回路規模の増大や、演算プロセッサの能力に依存して演算速度低下が発生するため近年ではReLU(Rectified Linear Unit)352などのランプ関数が使われる場合が多い。ReLU352は、関数への入力値が0以下の場合には出力値が常に0、入力値が0より上の場合には出力値が入力値と同じ値となる関数であり、シグモイド関数351よりもニューラルネットワーク340の層が深くなった場合でも勾配の消失が起こりにくく、計算式で簡素なため処理速度に優位性がある。また、ReLU352の派生であるLeakyReLU(Leaky Rectified Linear Unit)353なども、ReLU352よりも精度が良いため使用される場合が増加している。LeakyReLU353は、入力値が0より下なら入力値をα倍した値(α倍は、例えば0.01倍(基本))、入力値が0より上の場合には出力値が入力値と同じ値となる関数である。それ以外の活性化関数350として、検出物体のクラス識別時に活用されるソフトマックス関数などがあり、使用用途に応じて適する関数を使い分ける。ソフトマックス関数は、複数の出力値の合計が1.0(100%)になるように変換して出力するものである。 Next, the activation function 350 will be explained. It is known that the activation function 350 needs to be a non-linear transformation, since repeating a linear transformation only transforms it into a linear transformation. The activation function 350 is a step function that simply identifies "0" or "1", a sigmoid function 351, a ramp function, etc.; In recent years, a ramp function such as ReLU (Rectified Linear Unit) 352 is often used because the calculation speed decreases. ReLU352 is a function whose output value is always 0 when the input value to the function is less than or equal to 0, and whose output value is the same as the input value when the input value is greater than 0. Even when the layers of the neural network 340 become deep, the gradient is less likely to disappear, and the calculation formula is simple, so it has an advantage in processing speed. Furthermore, Leaky ReLU (Leaky Rectified Linear Unit) 353, which is a derivative of ReLU 352, is also increasingly used because it has better accuracy than ReLU 352. LeakyReLU353 multiplies the input value by α if the input value is lower than 0 (α multiplication is, for example, 0.01 times (basic)), and if the input value is higher than 0, the output value is the same value as the input value. This is the function. Other activation functions 350 include a softmax function that is used when identifying the class of a detected object, and a suitable function is used depending on the purpose of use. The softmax function converts and outputs a plurality of output values so that the sum total becomes 1.0 (100%).
 図3A及び図3Bは、DNNモデル310の1つであるYOLOモデル360の構成の一例である。図3Aに示すYOLOモデル360は、例えば、水平方向の画素Xi、垂直方向の画素Yiを入力画像サイズとするものであって良い。周辺ピクセルの領域をフィルタリングにより畳み込むことにより領域ベースの特徴量を圧縮して抽出することが可能なConvolutionレイヤ370、ないし、387と、入力画像におけるフィルタ形状の位置ずれを吸収するように機能するPoolingレイヤ390、ないし、395と、全結合層と出力層を基本構成としているものであって良い。また、例えば、物体の位置の検出とクラス分類(識別)を行うための第一の検出レイヤ361、第二の検出レイヤ362、および、第三の検出レイヤ363を備え、クラス分類の結果に対して逆畳み込みを使用したアップサンプリングするためのUpsamplingレイヤ364と365などで構成されるものであっても良い。これら、モデル入力画像サイズ、Convolutionレイヤ、Poolingレイヤ、検出レイヤ、Upsamplingレイヤなどの画素(ピクセル)サイズや、各種レイヤの数や組み合わせ構成や、検出レイヤの数や配置などは、使用用途に応じて増減、もしくは、変更されるものであっても良い。 3A and 3B are examples of the configuration of a YOLO model 360, which is one of the DNN models 310. The YOLO model 360 shown in FIG. 3A may have, for example, a horizontal pixel Xi and a vertical pixel Yi as the input image size. Convolution layers 370 to 387 that can compress and extract region-based feature amounts by convolving the region of surrounding pixels by filtering, and Pooling that functions to absorb positional deviation of the filter shape in the input image. The basic configuration may be layers 390 to 395, a fully connected layer, and an output layer. In addition, for example, it includes a first detection layer 361, a second detection layer 362, and a third detection layer 363 for detecting the position of the object and classifying (identifying) the object, and The upsampling layer 364 and 365 for upsampling using deconvolution may also be used. The model input image size, the pixel size of the convolution layer, pooling layer, detection layer, upsampling layer, etc., the number and combination of various layers, the number and arrangement of detection layers, etc., depend on the intended use. It may be increased, decreased, or changed.
 Convolutionレイヤ370、ないし、387は、ある特定の形状や様々な形状に反応する単純型細胞をモデル化したものに相当し、複雑な形状の物体を認識するために活用されるものである。 The Convolution layers 370 to 387 correspond to models of simple cells that respond to a specific shape or various shapes, and are used to recognize objects with complex shapes.
 一方、Poolingレイヤ390、ないし、395は、形状の空間的なずれを吸収するような働きをする複雑型細胞をモデル化したものに相当し、ある形状の物体の位置がずれると別の形状とみなすところを同一形状にみなせるように働くものである。これらConvolutionレイヤ370、ないし、387とPoolingレイヤ390、ないし、395を複合させることで、様々な複雑な形状の検出物体の移動や変更に頑強になり、物体検出の精度を向上させることが可能となる。 On the other hand, the Pooling layers 390 to 395 correspond to models of complex cells that function to absorb spatial deviations in shape, and when the position of an object of one shape shifts, it changes to another shape. It works so that all parts can be regarded as having the same shape. By combining these Convolution layers 370 to 387 and Pooling layers 390 to 395, it becomes possible to improve the accuracy of object detection by making it robust against movement and changes of detection objects with various complex shapes. Become.
 Upsamplingレイヤ364と365は、元の画像についてのクラス分類を行うとともに、図3Aの366と367に示すスキップ接続を通してCNNの各層における結果を特徴マップとして用いることにより、例えば、第二の検出レイヤ362、および、第三の検出レイヤ363により細かい領域の特定が可能になる。なお、スキップ接続367と366は、それぞれ、Convolutionレイヤ373と374と同じ構成のネットワークをConvolutionレイヤ385後と381後に結合するものである。 Upsampling layers 364 and 365 perform class classification on the original image and use the results in each layer of the CNN as a feature map through skip connections shown at 366 and 367 in FIG. , and the third detection layer 363 enable detailed region identification. Note that the skip connections 367 and 366 connect networks having the same configuration as the convolution layers 373 and 374 after the convolution layers 385 and 381, respectively.
 次に、ある実施形態におけるYOLOモデル360の検出精度や確信度に相当するConfidencescore(信頼度スコア)317(尤度に相当)算出方法を検出物体として人物1名を対象とする図3Bにより説明する。処理速度に優位性が高いとされるYOLOなどに代表されるone―stage型のDNNモデルを使用する場合は、物体の位置の検出とクラス識別を同時に行うために、モデル入力画像311に対して画像領域を任意のサイズのグリッドセルに分割する(図3Bでは7x7の例を示す)。複数のBoundingBBoxとConfidence(信頼度)313(Pr(Object)×IOU)を推測する工程312と、グリッドセル単位で条件付きクラス確率(conditionalclass probabilities)であるPr(Classi|Object)315を算出する工程314を並行して処理するものである。その後、最終検出工程316でConfidencescore(信頼度スコア)317を算出する際に双方を乗算するものである。したがって、物体の位置の検出とクラス識別を同時に行うことにより処理速度の向上が可能となる。なお、最終検出工程316の点線で示した第一の検出枠を含む位置情報の検出枠318が、人物に対する検出結果として表示された検出枠である。 Next, a method for calculating a confidence score 317 (corresponding to likelihood), which corresponds to the detection accuracy and confidence of the YOLO model 360 in an embodiment, will be explained with reference to FIG. 3B, in which one person is used as the detection object. . When using a one-stage DNN model such as YOLO, which is said to have a high processing speed, the model input image 311 is Divide the image area into grid cells of arbitrary size (a 7x7 example is shown in FIG. 3B). A step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object)×IOU), and a step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object) ) 315 calculation process 314 are processed in parallel. Thereafter, both are multiplied when calculating a confidence score 317 in a final detection step 316. Therefore, processing speed can be improved by simultaneously detecting the position of the object and identifying the class. Note that a position information detection frame 318 including the first detection frame indicated by a dotted line in the final detection step 316 is a detection frame displayed as a detection result for a person.
 図1に示す物体検出モデル300から出力された1つの画像中に存在する検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302は、モデル後処理手段400に入力した後、第一の検出枠を含む位置情報301の相互のIOU値による選別や第一の尤度情報302の最大判定などにより、各検出物体にする最も適切と考えられる第二の検出枠を含む位置情報401と第二の尤度情報402に補正される。第二の検出枠を含む位置情報401と第二の尤度情報402は、画像出力制御手段110とロバスト性検証手段500に入力される。ここで、第二の検出枠を含む位置情報401は、例えば、検出枠の中心座標、水平方向の幅、垂直方向の高さを含む情報であり、第二の尤度情報402は、例えば、検出精度を示す尤度とクラス識別情報である。 For each detected object present in one image output from the object detection model 300 shown in FIG. After inputting the degree information 302 to the model post-processing means 400, the position information 301 including the first detection frame is sorted based on mutual IOU values, the maximum likelihood information 302 is determined, etc., and each detected object is The position information 401 and the second likelihood information 402 are corrected to include the second detection frame considered to be the most appropriate. Position information 401 including the second detection frame and second likelihood information 402 are input to the image output control means 110 and the robustness verification means 500. Here, the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.
 図4により、IOU値を説明する。図4の(a)のIOU値420を表す式の分母は、前述した(式1)におけるAreaof Union422であり、比較する2つの枠領域の和集合の面積である。図4の(a)のIOU値420を表す式の分子は、前述した(式1)におけるAreaof Intersection423であり、比較する2つの枠領域の共通部分の面積である。最大“1.0”であり、完全に2つの枠データが重なっている状態を示す。物体検出モデル300の出力結果である第一の検出枠を含む位置情報301と後述する正解となる検出枠を含む位置情報621のIOU値420が大きいほど物体検出がうまくできていることになる。なお、一例として人物を検出する場合、図4の(b)に示すように人物424に対する正解枠であるgroundtruth BBox425と推論(予測)した結果として算出されるPredictedBBox426が水平方向と垂直方向にそれぞれ11%程度ずれると、両者のIOU値427が0.65程度まで低下する。この点からもわかるように物体検出の精度や性能を敏感に検証する指標の1つとして活用されることが多い。 The IOU value will be explained with reference to FIG. The denominator of the formula representing the IOU value 420 in FIG. 4(a) is the Area of Union 422 in the above-mentioned (Formula 1), which is the area of the union of the two frame areas to be compared. The numerator of the formula representing the IOU value 420 in (a) of FIG. 4 is the Area of Intersection 423 in the above-mentioned (Formula 1), which is the area of the common portion of the two frame regions to be compared. The maximum value is "1.0", indicating that the two frame data completely overlap. The larger the IOU value 420 of the position information 301 including the first detection frame, which is the output result of the object detection model 300, and the position information 621 including the correct detection frame (described later), the better the object detection is. As an example, when detecting a person, as shown in FIG. 4B, the groundtruth BBox 425, which is the correct answer frame for the person 424, and the PredictedBBox 426, which is calculated as a result of inference (prediction), are 11 in the horizontal direction and 11 in the vertical direction. If there is a difference of about %, the IOU value 427 of both will drop to about 0.65. As can be seen from this point, it is often used as an index to sensitively verify the accuracy and performance of object detection.
 図1に示すモデル後処理手段400は、物体検出モデル300の出力結果の1つないし複数の検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302に対して、第一の尤度情報302に対する任意の閾値T(任意の小数)と相互の第一の検出枠を含む位置情報301の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(任意の小数)により検出物体毎に最尤の第二の検出枠を含む位置情報401と第二の尤度情報402に補正する個体識別手段410を有することを特徴とするものであっても良い。 The model post-processing means 400 shown in FIG. and the first likelihood information 302, how much does an arbitrary threshold T (an arbitrary decimal number) for the first likelihood information 302 and the area of the position information 301 including the mutual first detection frame overlap? Corrected position information 401 and second likelihood information 402 including the maximum likelihood second detection frame for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index representing the It may also be characterized by having an individual identification means 410.
 例えば、図5Aのフローチャートと図5Bに示すような検出物体として人物が2名前後に近接しているモデル入力画像440を使用して、モデル後処理手段400の個体識別手段410の処理の一例を説明する。 For example, an example of the processing of the individual identification means 410 of the model post-processing means 400 will be explained using the flowchart of FIG. 5A and a model input image 440 in which a person is located close to each other after two names as detected objects as shown in FIG. 5B. do.
 はじめに、図5Aに示すように、入力ステップS430により、検出物体毎の検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302を入力する。その際、図5Bに示すように、物体検出モデル300から出力される4つの第一の検出枠を含む位置情報441、442、443、および、444と、4つの第一の尤度情報の中の尤度445、446、447、および、448が入力されるものとする。 First, as shown in FIG. 5A, in input step S430, position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input. . At that time, as shown in FIG. 5B, among the position information 441, 442, 443, and 444 including the four first detection frames output from the object detection model 300 and the four first likelihood information, It is assumed that likelihoods 445, 446, 447, and 448 are input.
 次に、設定ステップS431により、IOU閾値“U”と尤度閾値“T”を設定する。本例では、“U”=0.7、“T”=0.5を閾値として設定した場合を示す。 Next, in setting step S431, the IOU threshold "U" and the likelihood threshold "T" are set. This example shows a case where "U"=0.7 and "T"=0.5 are set as the threshold values.
 次に、比較ステップS432により、第一の尤度情報302の中の尤度を閾値“T”と比較して、尤度が閾値“T”未満で偽と判定すると、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、尤度が閾値“T”以上であれば真と判定して、相互IOU値算出ステップS434により、算出対象のすべての第一の検出枠を含む位置情報301の相互組み合わせのIOU値を算出する処理を行う。図5Bでは、尤度446が0.33であり閾値“T”=0.5未満のため2人を包括するように疑似検出された第一の検出枠を含む位置情報442と尤度446を含む第一の尤度情報は算出対象から削除される。残り算出候補は3つとなり、それぞれの第一の検出枠を含む位置情報441、443、および、444の相互組み合わせのIOU値を算出する。 Next, in the comparison step S432, the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold "T", it is determined to be true, and the mutual IOU value calculation step S434 is performed. , a process is performed to calculate the IOU value of the mutual combination of the position information 301 including all the first detection frames to be calculated. In FIG. 5B, since the likelihood 446 is 0.33 and is less than the threshold value "T" = 0.5, the likelihood 446 and the position information 442 including the first detection frame that is pseudo-detected to include two people are shown. The first likelihood information included is deleted from the calculation target. There are three calculation candidates remaining, and the IOU values of mutual combinations of position information 441, 443, and 444 including the respective first detection frames are calculated.
 次に、比較ステップS435により、すべての相互IOU値に対して、閾値“U”と比較して、相互IOU値が閾値“U”未満で偽と判定すると、独立した検出結果であると判定して、出力ステップS437により、第二の検出枠を含む位置情報401と第二の尤度情報402として出力し、相互IOU値が閾値“U”以上であれば真と判定して、同一の検出物体を重複して検出しているとみなし、次の最大尤度判定ステップS436に進む。図5Bでは、第一の検出枠を含む位置情報441と他の2つとの相互IOU値が閾値“U”=0.7未満になるため、独立した検出情報として第一の検出枠を含む位置情報441と尤度445(0.85)を含む第一の尤度情報を第二の検出枠を含む位置情報451と尤度453(0.85)を含む第二の尤度情報として出力ステップS437により出力する。一方、第一の検出枠を含む位置情報443と444は、相互IOU値が近接しているため“U”=0.7以上と判定されて次の最大尤度判定ステップS436に進む。 Next, in comparison step S435, all mutual IOU values are compared with the threshold value "U", and if the mutual IOU value is less than the threshold value "U" and determined to be false, it is determined that the detection results are independent. Then, in output step S437, the position information 401 including the second detection frame and the second likelihood information 402 are outputted, and if the mutual IOU value is equal to or greater than the threshold value "U", it is determined to be true, and the same detection It is assumed that the object is detected redundantly, and the process proceeds to the next maximum likelihood determination step S436. In FIG. 5B, since the mutual IOU value between the position information 441 including the first detection frame and the other two is less than the threshold value "U" = 0.7, the position including the first detection frame is treated as independent detection information. Outputting first likelihood information including information 441 and likelihood 445 (0.85) as second likelihood information including position information 451 including second detection frame and likelihood 453 (0.85) Output by S437. On the other hand, since the position information 443 and 444 including the first detection frame have mutual IOU values close to each other, it is determined that "U" is equal to or greater than 0.7, and the process proceeds to the next maximum likelihood determination step S436.
 最後に、最大尤度判定ステップS436により、該当する中で尤度が最大となるもの以外は偽と判定して、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、該当する中で尤度が最大になるものは、真と判定して、出力ステップS437により、第二の検出枠を含む位置情報401と第二の尤度情報402として出力するものであっても良い。図5Bでは、尤度447(0.75)と尤度448(0.92)の2つから最大尤度判定を行った結果、尤度447(0.75)を含む第一の尤度情報と第一の検出枠を含む位置情報443を算出対象から削除し、最大尤度と判定された尤度448(0.92)を含む第一の尤度情報と第一の検出枠を含む位置情報444を第二の検出枠を含む位置情報452と尤度454(0.92)を含む第二の尤度情報として出力ステップS437により出力する。 Finally, in the maximum likelihood determination step S436, the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false. The likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S437, the position information 401 including the second detection frame and the second detection frame are deleted. It may also be output as likelihood information 402. In FIG. 5B, as a result of performing maximum likelihood determination from two likelihoods, 447 (0.75) and 448 (0.92), the first likelihood information including likelihood 447 (0.75) and the position information 443 including the first detection frame are deleted from the calculation target, and the position including the first likelihood information including the likelihood 448 (0.92) determined to be the maximum likelihood and the first detection frame is calculated. The information 444 is output as position information 452 including the second detection frame and second likelihood information including the likelihood 454 (0.92) in output step S437.
 なお、尤度の閾値“T”は高いほど検出された情報の信頼性は高くなるが、一方で、検出不能に陥る場合が生じるため、物体検出モデル300の性能に応じて適切に設定することが望ましい。 Note that the higher the likelihood threshold "T", the higher the reliability of detected information, but on the other hand, there may be cases where detection is not possible, so it should be set appropriately according to the performance of the object detection model 300. is desirable.
 なお、相互IOU値の閾値“U”は、低くすると検出物体が複数あった場合に、特に近接距離にある物体同士は、想定以上に複数の検出物体の検出結果をマージしてしまうため、検出漏れが発生し易くなる。一方で、高くすると同一物体を検出しているにも関わらず、重複した検出結果が残ってしまう場合がある。そのため、物体検出モデル300の性能に応じて適切に設定することが望ましい。 Note that if the mutual IOU value threshold "U" is set low, when there are multiple detected objects, the detection results of multiple detected objects will be merged more than expected, especially for objects that are close to each other. Leaks are more likely to occur. On the other hand, if the value is set high, duplicate detection results may remain even though the same object is detected. Therefore, it is desirable to set appropriately according to the performance of the object detection model 300.
 なお、個体識別手段410は、図5Aに示すようなフローチャート以外のステップの組み合わせで個体識別を行うものであっても良い。例えば、第一の尤度情報302の中のクラス識別情報を用いて、相互IOU値算出ステップS434で相互IOU値を算出する対象を同一クラスに限定する処理や、最大尤度判定ステップS436で最大尤度判定する際に同一クラスの中で最大尤度を判定する処理を加味したものであっても良い。 Note that the individual identification means 410 may perform individual identification using a combination of steps other than the flowchart shown in FIG. 5A. For example, the class identification information in the first likelihood information 302 may be used to limit the objects for which mutual IOU values are calculated in the mutual IOU value calculation step S434 to the same class, or the maximum When determining the likelihood, processing for determining the maximum likelihood within the same class may be added.
 図5Aと図5Bに示すようなモデル後処理手段400および個体識別手段410を有することにより、異常データの排除と検出物体毎に第二の検出枠を含む位置情報401と第二の尤度情報402を適した情報に補正することができる。 By having the model post-processing means 400 and the individual identification means 410 as shown in FIGS. 5A and 5B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.
 また、図1に示すモデル後処理手段400は、アノテーション手段620、および、すでにアノテーションの処理が施されているCOCOやPascalVOC Datasetなどのオープンソースのデータセットなどにより、検出物体毎に正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、アフィン変換関数や射影変換関数と演算プロセッサなどにより各種加工パラメータ510の内容にしたがって正解となる検出枠を含む位置情報621を補正する機能を有し、物体検出モデル300の出力結果の1つないし複数の検出物体毎に、検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302に対して、第一の尤度情報302に対する任意の閾値T(任意の小数)と正解となる検出枠を含む位置情報621と第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU値に対する任意の閾値U(任意の小数)により検出物体毎に最尤の第二の検出枠を含む位置情報401と第二の尤度情報402に補正する個体識別手段410を有することを特徴とするものであっても良い。 In addition, the model post-processing means 400 shown in FIG. If there is positional information 621 including a frame and class identification information 622 that is the correct answer, the positional information 621 that includes the detected frame that is the correct answer according to the contents of various processing parameters 510 is generated using an affine transformation function, a projective transformation function, an arithmetic processor, etc. For each one or more detected objects output from the object detection model 300, the position information 301 includes zero or more first detection frames including undetectable and false detection. For the likelihood information 302 of Correction is made to position information 401 including the second detection frame with the maximum likelihood and second likelihood information 402 for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU value, which is an index showing how much overlap. It may also be characterized by having an individual identification means 410.
 アノテーション手段620は、例えば、表示およびデータ格納手段120に格納された画像に対して、クラス識別情報と正解枠であるgroundtruth BBoxを付加して教師ありデータを作成するものであってもよい。 The annotation means 620 may create supervised data by adding class identification information and a groundtruth BBox, which is a correct answer frame, to the image stored in the display and data storage means 120, for example.
 例えば、図6Aのフローチャートと図6Bに示すような検出物体として人物が2名前後に近接しているモデル入力画像470を使用して、正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合のモデル後処理手段400の個体識別手段410の処理の一例を説明する。 For example, using the flowchart of FIG. 6A and the model input image 470 in which a person is close to two names as a detection object as shown in FIG. 6B, the position information 621 including the correct detection frame and the correct class identification An example of the processing of the individual identification means 410 of the model post-processing means 400 when the information 622 exists will be described.
 はじめに、図6Aに示すように、入力ステップS430により、検出物体毎の検出不能と疑似検出を含むゼロないし複数の第一の検出枠を含む位置情報301と第一の尤度情報302を入力する。また、入力ステップS460により、検出物体毎の正解となる検出枠を含む位置情報621と正解となるクラス識別情報622を入力する。その際、図6Bの点線で示すように、物体検出モデル300から出力される4つの第一の検出枠を含む位置情報471、472、473、および、474と、4つの第一の尤度情報の中の尤度475、476、477、および、478が入力されるものとする。また、図6Bの実線で示すように、アノテーション手段620から出力される2つの正解となる検出枠を含む位置情報480と481と、2つの正解となる“人”を示すクラス識別情報482と483が入力されるものとする。 First, as shown in FIG. 6A, in input step S430, position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input. . Further, in input step S460, position information 621 including a detection frame that is the correct answer for each detected object and class identification information 622 that is the correct answer are input. At that time, as shown by the dotted line in FIG. 6B, position information 471, 472, 473, and 474 including the four first detection frames output from the object detection model 300, and four first likelihood information It is assumed that likelihoods 475, 476, 477, and 478 are input. Further, as shown by the solid line in FIG. 6B, position information 480 and 481 including two correct detection frames outputted from the annotation means 620, and class identification information 482 and 483 indicating "person" as two correct answers. shall be input.
 次に、設定ステップS431により、IOU閾値“U”と尤度閾値“T”を設定する。本例では、“U”=0.5、“T”=0.5を閾値として設定した場合を示す。 Next, in setting step S431, the IOU threshold "U" and the likelihood threshold "T" are set. This example shows a case where "U"=0.5 and "T"=0.5 are set as the threshold values.
 次に、比較ステップS432により、第一の尤度情報302の中の尤度を閾値“T”と比較して、尤度が閾値“T”未満で偽と判定すると、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、尤度が閾値“T”以上であれば真と判定して、正解枠とのIOU値算出ステップS461により、正解となる検出枠を含む位置情報621のそれぞれに対して、算出対象のすべての第一の検出枠を含む位置情報301の組み合わせのIOU値を算出する処理を行う。図6Bでは、尤度476が0.33であり閾値“T”=0.5未満のため2人を包括するように疑似検出された第一の検出枠を含む位置情報472と尤度476を含む第一の尤度情報は算出対象から削除される。残り算出候補は3つとなり、正解となる検出枠を含む位置情報480と481のそれぞれに対して、第一の検出枠を含む位置情報471、473、および、474のIOU値を算出する。 Next, in the comparison step S432, the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold value “T”, it is determined to be true, and the IOU value with the correct frame is calculated. In step S461, processing is performed to calculate the IOU value of the combination of the position information 301 including all the first detection frames to be calculated for each piece of position information 621 including the correct detection frame. In FIG. 6B, since the likelihood 476 is 0.33 and is less than the threshold "T" = 0.5, the likelihood 476 is combined with the position information 472 including the first detection frame that is pseudo-detected to include two people. The first likelihood information included is deleted from the calculation target. There are three calculation candidates remaining, and the IOU values of the position information 471, 473, and 474 including the first detection frame are calculated for each of the position information 480 and 481 including the correct detection frame.
 次に、比較ステップS462により、すべてのIOU値に対して、閾値“U”と比較して、正解となる検出枠を含む位置情報621に対するIOU値が閾値“U”未満で偽と判定すると、正解枠から大きく外れていると判定して、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、IOU値が閾値“U”以上であれば真と判定して、正解枠からの差が小さい検出対象候補とみなし、次のクラス識別判定ステップS463に進む。図6Bでは、偽に判定される候補は該当なしとなり、そのまま3つの算出候補がクラス識別判定ステップS463の判定対象になる。 Next, in comparison step S462, all IOU values are compared with the threshold value "U", and if the IOU value for the position information 621 including the correct detection frame is less than the threshold value "U" and determined to be false, It is determined that the position information 301 and the first likelihood information 302 including the corresponding first detection frame are deleted from the calculation target in a deletion step S433, and the IOU value is set to the threshold “U”. ``If it is, it is determined to be true, and it is regarded as a detection target candidate with a small difference from the correct answer frame, and the process proceeds to the next class identification determination step S463. In FIG. 6B, candidates that are determined to be false are not applicable, and the three calculation candidates become the determination targets of the class identification determination step S463.
 次に、クラス識別判定ステップS463により、正解となるクラス識別情報622と正解となる第一の尤度情報302の中のクラス識別情報を比較して、異なるクラスと識別されている場合は偽と判定して、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、同一のクラスと識別されている場合は、真と判定して、次の最大尤度判定ステップS436に進む。図6Bでは、すべての候補がクラス識別の結果“人”と判定されているとして、そのまま3つの算出候補が最大尤度判定ステップS436の判定対象となる。 Next, in class identification determination step S463, the class identification information 622 that is the correct answer and the class identification information in the first likelihood information 302 that is the correct answer are compared, and if they are identified as different classes, it is determined to be false. Then, in a deletion step S433, the position information 301 including the corresponding first detection frame and the first likelihood information 302 are deleted from the calculation target, and if they are identified as the same class, it is determined to be true. The process then proceeds to the next maximum likelihood determination step S436. In FIG. 6B, assuming that all the candidates are determined to be "human" as a result of class identification, the three calculation candidates are directly subjected to the determination in the maximum likelihood determination step S436.
 最後に、最大尤度判定ステップS436により、該当する中で尤度が最大となるもの以外は偽と判定して、削除ステップS433により、該当する第一の検出枠を含む位置情報301と第一の尤度情報302を算出対象から削除し、該当する中で尤度が最大になるものは、真と判定して、出力ステップS464により、第二の検出枠を含む位置情報401と第二の尤度情報402と算出したIOU値を出力するものであっても良い。図6Bでは、正解となる検出枠を含む位置情報481と正解となるクラス識別情報483の検出結果として、尤度477(0.75)と尤度478(0.92)の2つから最大尤度判定を行った結果、尤度477(0.75)を含む第一の尤度情報と第一の検出枠を含む位置情報473を算出対象から削除し、最大尤度と判定された尤度478(0.92)を含む第一の尤度情報と第一の検出枠を含む位置情報474を第二の検出枠を含む位置情報491と尤度493(0.92)を含む第二の尤度情報として出力ステップS464により出力する。さらに、IOU値495(0.85)を出力ステップS464により出力する。また、正解となる検出枠を含む位置情報480と正解となるクラス識別情報482の検出結果として、最大尤度と判定された尤度475(0.85)を含む第一の尤度情報と第一の検出枠を含む位置情報471を第二の検出枠を含む位置情報490と尤度492(0.85)を含む第二の尤度情報として出力ステップS464により出力する。さらに、IOU値494(0.73)を出力ステップS464により出力する。 Finally, in the maximum likelihood determination step S436, the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false. The likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S464, the position information 401 including the second detection frame and the second detection frame are deleted. The likelihood information 402 and the calculated IOU value may be output. In FIG. 6B, as the detection results of the position information 481 including the correct detection frame and the correct class identification information 483, the maximum likelihood is calculated from the two likelihoods 477 (0.75) and 478 (0.92). As a result of the degree determination, the first likelihood information including the likelihood 477 (0.75) and the position information 473 including the first detection frame are deleted from the calculation target, and the likelihood determined to be the maximum likelihood is 478 (0.92) and position information 474 including the first detection frame are combined with position information 491 including the second detection frame and second likelihood information 493 (0.92). It is output as likelihood information in output step S464. Further, an IOU value of 495 (0.85) is output in output step S464. Furthermore, as the detection results of the position information 480 including the correct detection frame and the correct class identification information 482, first likelihood information including the likelihood 475 (0.85) determined to be the maximum likelihood and the first likelihood information including the likelihood 475 (0.85) determined to be the maximum likelihood Position information 471 including one detection frame is output as position information 490 including a second detection frame and second likelihood information including likelihood 492 (0.85) in output step S464. Further, the IOU value 494 (0.73) is outputted in output step S464.
 なお、尤度の閾値“T”は高いほど検出された情報の信頼性は高くなるが、一方で、検出不能に陥る場合が生じるため、物体検出モデル300の性能に応じて適切に設定することが望ましい。 Note that the higher the likelihood threshold "T", the higher the reliability of detected information, but on the other hand, there may be cases where detection is not possible, so it should be set appropriately according to the performance of the object detection model 300. is desirable.
 なお、正解枠とのIOU値の閾値“U”は、図5Aと図5Bにより説明した個体識別手段410に比べると、低めに設定して算出候補をより多く残したとしても、正解となる検出枠を含む位置情報621との直接比較ができるため、検出漏れが発生しにくく、検出結果の精度が向上する利点がある。また、閾値“U”を任意に変更して処理することにより、物体検出モデル300で算出される第一の検出枠を含む位置情報301の検出枠の正確度を把握および検証することも可能となる。したがって、検出枠の正確度を向上するために必要な学習条件等も抽出することができるため、実施形態2で後述する辞書学習手段600により、モデル学習辞書320の検出枠を含む位置情報に対する汎用性やロバスト性のより正確な強化が可能となる。 Note that even if the threshold value "U" of the IOU value with the correct answer frame is set lower than that of the individual identification means 410 described with reference to FIGS. 5A and 5B, and more calculation candidates are left, the detection results in a correct answer. Since direct comparison can be made with the position information 621 including the frame, there is an advantage that detection omissions are less likely to occur and the accuracy of the detection results is improved. Furthermore, by arbitrarily changing the threshold value “U” and processing, it is also possible to understand and verify the accuracy of the detection frame of the position information 301 including the first detection frame calculated by the object detection model 300. Become. Therefore, learning conditions necessary for improving the accuracy of the detection frame can also be extracted, so that the dictionary learning means 600, which will be described later in Embodiment 2, can be used for general purposes regarding position information including the detection frame of the model learning dictionary 320. This makes it possible to more accurately enhance performance and robustness.
 図6Aと図6Bに示すようなモデル後処理手段400および個体識別手段410を有することにより、異常データの排除と検出物体毎に第二の検出枠を含む位置情報401と第二の尤度情報402を適した情報に補正することができる。 By having the model post-processing means 400 and the individual identification means 410 as shown in FIGS. 6A and 6B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.
 これら、画像処理手段100とモデル前処理手段200と物体検出モデル300とモデル後処理手段400により第二の検出枠を含む位置情報401と第二の尤度情報402を生成する一連の手段が、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための従来の図17に示す第一の性能指標化装置30であった。 A series of means for generating position information 401 including the second detection frame and second likelihood information 402 using the image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This was a conventional first performance indexing device 30 shown in FIG. 17 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.
 一例として、従来の第一の性能指標化装置30の課題について、物体の位置の検出とクラス識別を同時に行うため処理速度に優位性が高いとされるone―stage型のDNNモデルに代表されるYOLOモデル360を適用した場合に関して、図7Aと図7Bを使用して説明する。図7Aに示すように、モデル前処理手段200により水平方向にXiピクセル(画素)と垂直方向にYiピクセルに加工されたモデル入力画像201の中にいる1名の人物の検出をする場合、カメラなどで画像を取得する際の時系列での手振れや振動などにより、ある基準時間の水平基準位置で取得した画像201に対して、時系列変化で、水平方向に2ピクセル分画像がシフトした画像202と、水平方向に4ピクセル分画像がシフトした画像203を、それぞれYOLOモデル360(物体検出モデル300)に入力して、モデル後処理手段400で補正された結果として、第二の検出枠を含む位置情報207と208と209、第二の尤度情報の中の尤度214と215と216が算出された場合、同一の人物を検出しているにも関わらず、画像中における人物の位置が少し揺らいで水平方向にシフトしただけであるが、それぞれの尤度が0.92、0.39、0.89と大きく変動する場合がある。 As an example, regarding the problem of the conventional first performance indexing device 30, a one-stage DNN model is typified by a one-stage DNN model that is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class. The case where the YOLO model 360 is applied will be explained using FIGS. 7A and 7B. As shown in FIG. 7A, when detecting one person in the model input image 201 processed by the model preprocessing means 200 into Xi pixels in the horizontal direction and Yi pixels in the vertical direction, the camera An image in which the image is shifted by 2 pixels in the horizontal direction due to time-series changes with respect to image 201, which was acquired at a horizontal reference position at a certain reference time, due to camera shake or vibration in the time-series when acquiring images, etc. 202 and an image 203 in which the image has been shifted by 4 pixels in the horizontal direction are input to the YOLO model 360 (object detection model 300), and as a result of correction by the model post-processing means 400, a second detection frame is created. When the included position information 207, 208, and 209 and the likelihoods 214, 215, and 216 in the second likelihood information are calculated, the position of the person in the image is calculated even though the same person is detected. Although there is only a slight fluctuation and shift in the horizontal direction, the respective likelihoods may vary greatly, such as 0.92, 0.39, and 0.89.
 一方で、図7Bに示すように、カメラと人物の距離が、1mの画像204と2mの画像205と3mの画像206で、人物のサイズと画像中の位置が変わった結果として、第二の検出枠を含む位置情報211と212と213、第二の尤度情報の中の尤度217と218と219が算出された場合、本来のYOLOモデルの性能を鑑みた場合は、人物サイズが小さくなる、もしくは、人物の距離が遠くなるにつれて検出精度や性能が低下することが課題として知られているが、本例では、検出物体距離1mの画像204の第二の尤度情報の中の尤度217が0.92、検出物体距離3mの画像206の第二の尤度情報の中の尤度219が0.71に対して、検出物体距離2mの画像205の第二の尤度情報の中の尤度218が0.45と大幅に低下しているという不規則な結果が得られる場合がある。 On the other hand, as shown in FIG. 7B, the distance between the camera and the person is 1 m in image 204, 2 m in image 205, and 3 m in image 206, as a result of the change in the size of the person and the position in the image. When the positional information 211, 212, and 213 including the detection frame and the likelihoods 217, 218, and 219 in the second likelihood information are calculated, when considering the performance of the original YOLO model, the person size is small. It is known that the detection accuracy and performance deteriorate as the distance between the person and the person increases. The degree 217 is 0.92, and the likelihood 219 in the second likelihood information of the image 206 with the detected object distance of 3 m is 0.71, whereas the second likelihood information of the image 205 with the detected object distance of 2 m is Irregular results may be obtained in which the likelihood 218 is significantly reduced to 0.45.
 これらの不規則な現象を把握するとともに、その要因を分析する手段として、本発明におけるある実施形態によれば、図8に示すように、モデル前処理手段200は、物体検出モデル300に入力する複数のモデル入力画像210を加工するに際して、各種加工パラメータ510として、S(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M個の位置シフトされたモデル入力画像221ないし224を生成する位置シフト機能220を備えるものであっても良い。また、任意の領域を切り取る機能を備えるものであっても良い。なお、位置シフト機能220は、アフィン変換関数291や射影変換関数292を画像処理プロセッサ290で実行して実現する機能であっても良い。 As a means of understanding these irregular phenomena and analyzing their causes, according to an embodiment of the present invention, as shown in FIG. When processing a plurality of model input images 210, as various processing parameters 510, in S (arbitrary decimal) pixel steps, N (arbitrary integer) times in the horizontal direction and M (arbitrary integer) in the vertical direction. It may also include a position shift function 220 that uses the position shifts to generate a total of N×M position shifted model input images 221 to 224. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 may be a function realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
 また、ある実施形態によれば、モデル前処理手段200は、物体検出モデルに入力する複数のモデル入力画像210を加工するに際して、各種加工パラメータ510として、さらに、L(任意の整数)種類の任意の倍率を使用して拡大もしくは縮小した画像を生成するリサイズ機能230を備え、リサイズした後、画像をS(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M×L個のリサイズおよび位置シフトされたモデル入力画像210を生成する位置シフト機能220を備えるものであっても良い。また、任意の領域を切り取る機能を備えるものであっても良い。なお、位置シフト機能220やリサイズ機能230は、アフィン変換関数291や射影変換関数292を画像処理プロセッサ290で実行して実現する機能であっても良い。 Further, according to an embodiment, when processing the plurality of model input images 210 to be input to the object detection model, the model preprocessing means 200 further sets L (arbitrary integer) types of arbitrary processing parameters 510 to the object detection model. The resizing function 230 generates an enlarged or reduced image using a magnification of Even if it includes a position shift function 220 that uses M (any integer) position shifts in the vertical direction to generate a total of N×M×L resized and position-shifted model input images 210. good. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 and the resizing function 230 may be realized by executing the affine transformation function 291 and the projective transformation function 292 in the image processing processor 290.
 一例として、図9に、基準サイズ画像232と30%縮小画像231と30%拡大画像233の3種類(L=3)のリサイズ画像を生成する場合を示す。それぞれの画像231、232,233に対して、図8に示すようにSピクセルステップでN×M個の位置シフトした画像を生成し、合計で3×N×M個の複数のモデル入力画像210を加工するものであっても良い。 As an example, FIG. 9 shows a case where three types (L=3) of resized images are generated: a reference size image 232, a 30% reduced image 231, and a 30% enlarged image 233. For each of the images 231, 232, and 233, as shown in FIG. It may also be something that processes.
 図8および図9に示すようなモデル前処理手段200の位置シフト機能220、および、リサイズ機能230により加工された複数のモデル入力画像210は、図1に示す物体検出モデル300とモデル後処理手段400により、複数のモデル入力画像210毎の第二の検出枠を含む位置情報401と第二の尤度情報402を算出した後、各種加工パラメータ510をもとに物体検出モデル300の汎用性やロバスト性を検証するロバスト性検証手段500に入力される。 A plurality of model input images 210 processed by the position shift function 220 and resizing function 230 of the model pre-processing means 200 as shown in FIGS. 8 and 9 are combined with the object detection model 300 shown in FIG. 1 and the model post-processing means 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510. It is input to a robustness verification means 500 that verifies robustness.
 なお、カメラなどにより取得した画像中で物体検出を行うモデルの場合、ロバスト性検証手段500で検証する項目や各種変動条件は、例えば、背景(景色)、カメラのレンズ仕様、カメラを取り付ける高さや仰俯角など、画像サイズを含む検出対象領域と視野範囲、魚眼レンズを使用している場合のデワープ処理方法、日光や照明に依存する照度変化や黒つぶれや白飛び、逆光などの特殊条件、晴れ、曇り、雨、雪、霧などの天候条件があげられる。また、対象検出物体の画像中の位置(左右上下と奥行)、サイズ、輝度レベル、色情報を含む形状や特徴、アスペクト比、回転角度、対象検出物体の数、相互重複の状態、付属物の種類や大きさや付属位置、レンズのIRカットの有無、対象検出物体の移動速度、および、カメラ自体の移動速度などがあげられる。また、使用用途によっては、前述した項目や条件以外も追加される場合もある。これら各種条件や項目を踏まえて、各種加工パラメータ510を設定する。もしくは、各種加工パラメータ510を選定もしくは決定するものである。各種加工パラメータ510は、モデル前処理手段200と、モデル後処理手段400に入力される。モデル前処理手段200に入力される各種加工パラメータ510は、物体位置に伴う揺らぎによる影響を検証するための位置シフト機能220に関係するパラメータを含み、カメラのレンズ仕様、カメラを取り付ける高さや仰俯角の条件など、画像サイズを含む検出対象領域と視野範囲の物体サイズに伴う汎用性やロバスト性を検証するためのリサイズ機能230に関係するパラメータや、後述する他の複数のパラメータを複合するものであっても良い。 In the case of a model that detects objects in images acquired by a camera, etc., the items and various variable conditions to be verified by the robustness verification means 500 include, for example, the background (scenery), camera lens specifications, the height at which the camera is mounted, etc. Detection target area and field of view including image size such as elevation/depression angle, dewarping processing method when using a fisheye lens, illuminance changes depending on sunlight and lighting, special conditions such as blackout, overexposure, backlighting, clear weather, etc. Weather conditions include cloudy weather, rain, snow, and fog. In addition, the position (left, right, top, bottom, and depth) of the target detection object in the image, size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, mutual overlap status, and attachments are also included. These include the type, size, attached position, whether or not the lens has IR cut, the moving speed of the target detection object, and the moving speed of the camera itself. Furthermore, depending on the purpose of use, items and conditions other than those described above may be added. Various processing parameters 510 are set based on these various conditions and items. Alternatively, various processing parameters 510 are selected or determined. Various processing parameters 510 are input to model pre-processing means 200 and model post-processing means 400. Various processing parameters 510 input to the model preprocessing means 200 include parameters related to the position shift function 220 for verifying the influence of fluctuations accompanying the object position, camera lens specifications, the height and elevation/depression angle at which the camera is mounted, etc. It is a combination of parameters related to the resizing function 230 to verify the versatility and robustness of the detection target area including the image size and the object size of the field of view, such as the conditions for , and other multiple parameters described below. It's okay to have one.
 ある実施形態によれば、モデル後処理手段400は、モデル前処理手段200の複数の画像の加工に使用した各種加工パラメータ510と、個体識別手段410の出力結果を、検出物体毎に個別に紐づけた検出結果403(第二の検出枠を含む位置情報401と第二の尤度情報402などを含む)を、ロバスト性検証手段500に出力するものであっても良い。 According to one embodiment, the model post-processing means 400 individually links various processing parameters 510 used in processing the plurality of images by the model pre-processing means 200 and the output results of the individual identification means 410 for each detected object. The detected detection result 403 (including position information 401 including the second detection frame, second likelihood information 402, etc.) may be output to the robustness verification means 500.
 ある実施形態によれば、ロバスト性検証手段500は、モデル後処理手段400の出力結果である第二の検出枠を含む位置情報401と第二の尤度情報402の中の尤度をもとに、各種加工パラメータ510別に、検出物体毎の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504と、尤度に対するIOU値505のいずれかもしくは、すべてを算出する確率統計演算手段520を備えることを特徴とするものであっても良い。 According to an embodiment, the robustness verification means 500 is based on the likelihood in the position information 401 including the second detection frame and the second likelihood information 402, which are the output results of the model post-processing means 400. In addition, for each of the various processing parameters 510, a likelihood distribution 540 indicating the variation due to the position shift of each detected object, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a likelihood The standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood, the maximum likelihood 503 which is the maximum value of the valid region of likelihood, the minimum likelihood 504 which is the minimum value of the valid region of likelihood, and the likelihood It may be characterized by comprising a probability statistical calculation means 520 that calculates any or all of the IOU values 505 for the IOU values 505.
 ある実施形態によれば、ロバスト性検証手段500は、検出物体毎に正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、モデル後処理手段400の出力結果である第二の検出枠を含む位置情報401と正解となる検出枠を含む位置情報621のIOU値と第二の尤度情報402の中のクラス識別情報と正解となるクラス識別情報622から算出されたクラス識別正解率をもとに、各種加工パラメータ510別に、IOU値とクラス識別正解率に対する検出物体毎の位置シフトに伴うバラツキを示すIOU分布とクラス識別正解率分布、IOU値とクラス識別正解率の有効領域の平均値である平均IOU値と平均クラス識別正解率と、IOU値のヒストグラムと、クラス識別正解率のヒストグラムと、IOU値とクラス識別正解率の有効領域の標準偏差であるIOU値の標準偏差とクラス識別正解率の標準偏差と、IOU値とクラス識別正解率の有効領域の最大値である最大IOU値と最大クラス識別正解率、IOU値とクラス識別正解率の有効領域の最小値である最小IOU値と最小クラス識別正解率のいずれか、もしくは、すべてを算出する確率統計演算手段520を備えることを特徴とするものであっても良い。 According to an embodiment, the robustness verification means 500 uses the output result of the model post-processing means 400 when there is position information 621 including a correct detection frame and correct class identification information 622 for each detected object. Calculated from the IOU value of the position information 401 including the second detection frame, the position information 621 including the correct detection frame, the class identification information in the second likelihood information 402, and the correct class identification information 622. Based on the class identification accuracy rate, the IOU distribution and class identification accuracy rate distribution showing the variation due to the position shift of each detected object with respect to the IOU value and class identification accuracy rate, IOU value and class identification are calculated for each of various processing parameters 510. The average IOU value which is the average value of the effective area of the correct answer rate, the average class identification correct answer rate, the histogram of the IOU value, the histogram of the class identification correct answer rate, and the standard deviation of the effective area of the IOU value and the class identification correct answer rate. The standard deviation of the IOU value and the standard deviation of the class identification accuracy rate, the maximum IOU value and the maximum class identification accuracy rate which are the maximum values of the valid area of the IOU value and the class identification accuracy rate, and the effective area of the IOU value and the class identification accuracy rate. The present invention may be characterized in that it includes a probability statistical calculation means 520 that calculates either or both of the minimum IOU value, which is the minimum value of , and the minimum class identification correct answer rate.
 ある実施形態によれば、ロバスト性検証手段500は、さらに、各種加工パラメータ510別に、検出物体毎の尤度分布540における任意の閾値以下となる位置もしくは領域の抽出と、平均尤度501が任意の閾値以下となる検出物体の抽出と、尤度の標準偏差502が任意の閾値以上となる検出物体の抽出と、最大尤度503が任意の閾値以下となる検出物体の抽出と、最小尤度504が任意の閾値以下となる検出物体の抽出と、IOU値505が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段530を有することを特徴とするものであっても良い。 According to an embodiment, the robustness verification means 500 further extracts a position or region in which the likelihood distribution 540 for each detected object is equal to or less than an arbitrary threshold value for each of the various processing parameters 510, and extracts a position or region where the average likelihood 501 is an arbitrary value. extraction of detected objects whose standard deviation 502 of likelihood is equal to or greater than an arbitrary threshold; extraction of detected objects whose maximum likelihood 503 is equal to or less than an arbitrary threshold; The present invention is characterized by having learning reinforcement necessary item extraction means 530 that includes one or both of the following: extracting a detected object whose IOU value 504 is equal to or less than an arbitrary threshold value; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold value. It may be something that you do.
 ある実施形態によれば、ロバスト性検証手段500は、さらに、各種加工パラメータ510別に、検出物体毎のIOU分布における任意の閾値以下となる位置もしくは領域の抽出と、クラス識別正解率分布における任意の閾値以下となる位置もしくは領域の抽出と、平均IOU値が任意の閾値以下となる検出物体の抽出と、平均クラス識別正解率が任意の閾値以下となる検出物体の抽出と、IOU値の標準偏差が任意の閾値以上となる検出物体の抽出と、クラス識別正解率の標準偏差が任意の閾値以上となる検出物体の抽出と、最大IOU値が任意の閾値以下となる検出物体の抽出と、最大クラス識別正解率が任意の閾値以下となる検出物体の抽出と、最小IOU値が任意の閾値以下となる検出物体の抽出と、最小クラス識別正解率が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段530を有することを特徴とするものであっても良い。 According to an embodiment, the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and extracts an arbitrary value in the class identification accuracy rate distribution for each of the various processing parameters 510. Extraction of a position or area where the average IOU value is below a threshold value, extraction of detected objects whose average IOU value is below an arbitrary threshold value, extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and standard deviation of the IOU value. extraction of detected objects for which the standard deviation of the class classification accuracy rate is greater than or equal to an arbitrary threshold; extraction of detected objects for which the maximum IOU value is less than or equal to an arbitrary threshold; Extraction of detected objects whose class identification accuracy rate is below an arbitrary threshold, extraction of detected objects whose minimum IOU value is below an arbitrary threshold, and extraction of detected objects whose minimum class identification accuracy rate is below an arbitrary threshold. It may be characterized by having learning reinforcement necessary item extraction means 530 that includes any or all of them.
 ある実施形態によれば、ロバスト性検証手段500の確率統計演算手段520、および、学習強化必要項目抽出手段530は、尤度とIOU値とクラス識別正解率をもとにした確率統計演算の際に、対象となる検出物体に関係する画素が任意の割合で欠落している画像に対しては、演算対象から除外するような機能を備えることを特徴とするものであっても良い。 According to an embodiment, the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 perform probability statistical calculations based on the likelihood, IOU value, and class identification correct answer rate. Furthermore, it may be characterized by having a function of excluding from the calculation target images in which pixels related to the target detection object are missing at an arbitrary rate.
 これら、学習強化必要項目抽出手段530を使用して抽出された検出物体と判定条件によりモデル学習辞書320の強化対象を特定することが可能となる。さらに、物体検出モデル300の課題抽出をすることも可能となる。さらに、これらの抽出情報531を、実施形態2で後述する辞書学習手段600に入力して学習素材の選定やAugmentationの手法や学習パラメータに反映することによりモデル学習辞書320の汎用性やロバスト性を強化することが可能となる。 It becomes possible to specify the reinforcement targets of the model learning dictionary 320 based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530. Furthermore, it is also possible to extract problems for the object detection model 300. Furthermore, the versatility and robustness of the model learning dictionary 320 can be improved by inputting this extracted information 531 into a dictionary learning means 600, which will be described later in the second embodiment, and reflecting it in the selection of learning materials, the augmentation method, and the learning parameters. It becomes possible to strengthen it.
 これら、画像処理手段100と、位置シフト機能220やリサイズ機能230などを備えたモデル前処理手段200と、各種加工パラメータ510にしたがってモデル前処理手段200により加工された複数のモデル入力画像210と、複数のモデル入力画像210に対して第二の検出枠を含む位置情報401と第二の尤度情報402を算出する物体検出モデル300およびモデル後処理手段400と、モデル後処理手段400から出力される各種加工パラメータ510と第二の検出枠を含む位置情報401と第二の尤度情報402を検出物体毎に個別に紐づけた検出結果403を入力して物体検出モデル300の汎用性やロバスト性を検証するロバスト性検証手段500を含む一連の手段が、物体検出モデル300の性能分析や、画像内の物体の位置検出やクラス識別を行うモデル学習辞書320のロバスト性や学習の強化方針を分析するための本発明の物体検出における性能指標化装置10の1つの実施形態である。本発明の物体検出における性能指標化装置は、さらに、モデル学習辞書320を生成するための後述する実施形態2の辞書学習手段600、および、第二のmAP算出手段650を含む装置であっても良い。 These image processing means 100, a model preprocessing means 200 having a position shift function 220, a resizing function 230, etc., and a plurality of model input images 210 processed by the model preprocessing means 200 according to various processing parameters 510, An object detection model 300 and a model post-processing means 400 that calculate position information 401 including a second detection frame and second likelihood information 402 for a plurality of model input images 210; The versatility and robustness of the object detection model 300 is input by inputting a detection result 403 in which various processing parameters 510, position information 401 including the second detection frame, and second likelihood information 402 are linked individually for each detected object. A series of means including a robustness verification means 500 for verifying the performance of the object detection model 300 and the robustness and learning reinforcement policy of the model learning dictionary 320 for detecting the position and class identification of objects in images. 1 is an embodiment of a performance indexing device 10 in object detection of the present invention for analysis. The performance indexing device for object detection of the present invention may further include a dictionary learning means 600 of Embodiment 2 described later for generating a model learning dictionary 320, and a second mAP calculation means 650. good.
 一例として、前述した図1に示す本発明の実施形態1の物体検出における性能指標化装置10を使用して、図7Aと図7Bを使って説明した検出物体の画像中の人物の位置のゆらぎ、および、検出物体のサイズに対して、物体検出モデル300による検出結果である尤度などが不規則にバラツク現象を分析した結果を、図10、および、図11に示す。 As an example, using the performance indexing device 10 in object detection according to the first embodiment of the present invention shown in FIG. FIG. 10 and FIG. 11 show the results of analyzing the irregular variation phenomenon in the likelihood, etc., which is the detection result by the object detection model 300, with respect to , and the size of the detected object.
 図10、および、図11に示す分析結果は、図7A、図7B、図8、および、図9に示す複数のモデル入力画像の水平方向の画素数であるXiを128に、垂直方向の画素数であるYiを128に設定している場合である。また、検出対象は、人物1名である。また、図9に示すように、モデル前処理手段200のリサイズ機能230を使用して、基準サイズ画像232と30%縮小画像231と30%拡大画像233の3種類(3種類のL)のリサイズ画像に加工し、それぞれの加工された3種のリサイズ画像に対して、モデル前処理手段200の位置シフト機能220を使用して、1ピクセルステップ(S=1)で、水平方向に32回分(N=32)、垂直方向に32回分(M=32)の位置シフトをして、合計3×32×32個のモデル入力画像210を生成している場合である。また、図10、および、図11に示す分析結果は、水平方向の入力画素が128ピクセル、垂直方向の入力画素が128ピクセルで構成される図3A及び図3Bに示すYOLOモデル360(物体検出モデル300)とモデル後処理手段400を使用して、生成された3×32×32個の複数のモデル入力画像210に対して、人物1名に対する第二の検出枠を含む位置情報401と第二の尤度情報402を算出した後、ロバスト性検証手段500に入力して、各種加工パラメータ510である3種のリサイズパラメータに対して、確率統計演算手段520により、人物1名の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504を算出した結果である。なお、尤度分布540と、平均尤度501と、尤度のヒストグラム550と、尤度の標準偏差502と、最大尤度503と、最小尤度504は、尤度の最大値“1”を“100%”とする百分率(%)で表記するものであっても良い。図10、および、図11は、尤度を百分率(%)で表記したものである。なお、百分率に変換せず、直接小数で処理しても良い。なお、本例では掲載していないが、第二の尤度情報402の中の尤度に対する分布だけでなく、図1のアノテーション手段620から正解となる検出枠を含む位置情報621を参照可能な場合は、第二の検出枠を含む位置情報401とのIOU値505に対するIOU分布や統計結果を算出するものであっても良い。なお、本例では掲載していないが、人物以外の複数のクラス識別を行うような場合は、第二の尤度情報402の中の尤度に対する分布だけでなく、第二の尤度情報402の中のクラス識別情報に対するクラス識別分布や統計結果を算出するものであっても良い。 The analysis results shown in FIGS. 10 and 11 are obtained by setting Xi, which is the number of pixels in the horizontal direction, of the plurality of model input images shown in FIGS. 7A, 7B, 8, and 9 to 128, and pixels in the vertical direction. This is a case where the number Yi is set to 128. Furthermore, the detection target is one person. In addition, as shown in FIG. 9, the resizing function 230 of the model preprocessing means 200 is used to resize three types (three types L) of a standard size image 232, a 30% reduced image 231, and a 30% enlarged image 233. For each of the three processed resized images, the position shift function 220 of the model preprocessing means 200 is used to move 32 times ( N=32), the position is shifted 32 times (M=32) in the vertical direction, and a total of 3×32×32 model input images 210 are generated. The analysis results shown in FIGS. 10 and 11 are based on the YOLO model 360 (object detection model) shown in FIGS. 3A and 3B, which has 128 input pixels in the horizontal direction and 128 pixels in the vertical direction. 300) and the model post-processing means 400, the position information 401 including the second detection frame for one person and the second After calculating the likelihood information 402 of a likelihood distribution 540 indicating dispersion, an average likelihood 501 that is the average value of the valid region of likelihood, a histogram 550 of likelihood, and a standard deviation of likelihood 502 that is the standard deviation of the valid region of likelihood; This is the result of calculating the maximum likelihood 503, which is the maximum value of the valid region of likelihood, and the minimum likelihood 504, which is the minimum value of the valid region of likelihood. Note that the likelihood distribution 540, average likelihood 501, likelihood histogram 550, likelihood standard deviation 502, maximum likelihood 503, and minimum likelihood 504 are based on the maximum likelihood value “1”. It may also be expressed as a percentage (%) of "100%". In FIG. 10 and FIG. 11, the likelihood is expressed in percentage (%). Note that it is also possible to directly process the value as a decimal number without converting it to a percentage. Although not shown in this example, it is possible to refer not only to the distribution for the likelihood in the second likelihood information 402 but also to the position information 621 including the correct detection frame from the annotation means 620 in FIG. In this case, the IOU distribution and statistical results for the IOU value 505 with the position information 401 including the second detection frame may be calculated. Although not shown in this example, if multiple classes other than people are to be identified, not only the distribution of the likelihood in the second likelihood information 402 but also the distribution of the likelihood in the second likelihood information 402 The class identification distribution and statistical results for the class identification information in may be calculated.
 図10に示す尤度分布541は、図9に示すように、モデル入力画像232を基準画像として、各種加工パラメータ511の指示のもと、リサイズ機能230により30%縮小(L=1)して図9のモデル入力画像231に加工した後、図8に示すように、位置シフト機能220により1ピクセルステップ(S=1)で、水平方向に32回分(N=32)、垂直方向に32回分(M=32)の位置シフトされた複数のモデル入力画像を、YOLOモデル360(物体検出モデル300)とモデル後処理手段400と確率統計演算手段520を備えるロバスト性検証手段500に入力して算出されたものである。その際に、図10に示す各種加工パラメータ511ないし513は、モデル後処理手段400から出力される検出物体毎の検出結果403として、第二の検出枠を含む位置情報401と第二の尤度情報402に紐づけられるものであり、確率統計演算手段520で各種加工パラメータ毎に分析結果を算出する際に活用されるものであっても良い。 As shown in FIG. 9, the likelihood distribution 541 shown in FIG. 10 is obtained by using the model input image 232 as a reference image and reducing it by 30% (L=1) using the resizing function 230 under the instructions of various processing parameters 511. After processing the model input image 231 shown in FIG. 9, as shown in FIG. A plurality of position-shifted model input images (M=32) are input to the robustness verification means 500, which includes a YOLO model 360 (object detection model 300), a model post-processing means 400, and a probability statistical calculation means 520. It is what was done. At this time, various processing parameters 511 to 513 shown in FIG. It is linked to the information 402, and may be used when the probability statistical calculation means 520 calculates analysis results for each of various processing parameters.
 同様に、尤度分布542は、各種加工パラメータ512を使って、図9に示す基準サイズ(等倍)(L=2)のモデル入力画像232に対して算出されたものであり、尤度分布543は、各種加工パラメータ513を使って、図9に示す30%拡大(L=3)したモデル入力画像233に対して算出されたものである。図10に示す尤度分布541、542、および、543は、白色から黒色の濃淡バー521にしたがって、画面上の人物が存在する位置(ピクセル単位)の揺らぎに対する尤度(%)のレベルに応じて、白色(尤度0%相当)から黒色(尤度100%)の濃淡に色付けして表示したものである。ここで、図9に示す基準サイズ(等倍)(L=2)のモデル入力画像232に対する図10の尤度分布542の尤度(A)522と尤度(B)523と尤度(C)524と尤度(D)525は、それぞれ、図8に示す位置シフト機能220(S=1、N=32、M=32の場合)により加工されたモデル入力画像(A)221を基準に算出された尤度と、モデル入力画像(B)222を基準に算出された尤度と、モデル入力画像(C)223を基準に算出された尤度と、モデル入力画像(D)224を基準に算出された尤度をマッピングしたものに相当する。尤度分布541、542、および、543は、黒レベルが強いほど尤度が高いことを示しており、反対に、白レベルが強くなるほど尤度が低いことを示すものである。注目すべき点として、それぞれの尤度分布の中で、尤度が高い黒レベルの合間に、特定の格子状のようなパターンで灰色もしくは白レベルが強く尤度が低くなる領域が存在していることが確認できる。この結果は、図7Aおよび図7Bで説明した検出物体(本例では人物1名)の画面中の位置による揺らぎに対して、尤度が不規則にバラツクという現象が現れていると考えて良い。また、本例に示すような特定の格子状のようなパターンがある場合は、物体検出モデル300のモデルそのものに課題が存在している可能性が高く、任意の領域の尤度が低いような場合は、モデル学習辞書320の学習が不十分である可能性が高いと考えて良い。これら検出物体の位置による尤度の揺らぎに関する詳細な要因推定に関しては、図11の説明と合わせて言及することとする。 Similarly, the likelihood distribution 542 is calculated using the various processing parameters 512 for the model input image 232 of the standard size (equal size) (L=2) shown in FIG. 543 is calculated using various processing parameters 513 for the model input image 233 enlarged by 30% (L=3) shown in FIG. Likelihood distributions 541, 542, and 543 shown in FIG. 10 correspond to the level of likelihood (%) for fluctuations in the position (in pixels) of a person on the screen according to the gray scale bar 521 from white to black. It is displayed in shades of white (corresponding to 0% likelihood) to black (corresponding to 100% likelihood). Here, the likelihood (A) 522, the likelihood (B) 523, and the likelihood (C) of the likelihood distribution 542 in FIG. ) 524 and likelihood (D) 525 are based on the model input image (A) 221 processed by the position shift function 220 (in the case of S=1, N=32, M=32) shown in FIG. 8, respectively. The calculated likelihood, the likelihood calculated based on the model input image (B) 222, the likelihood calculated based on the model input image (C) 223, and the likelihood calculated based on the model input image (D) 224. This corresponds to a mapping of the likelihood calculated in . The likelihood distributions 541, 542, and 543 indicate that the stronger the black level, the higher the likelihood, and conversely, the stronger the white level, the lower the likelihood. It is worth noting that in each likelihood distribution, between black levels with high likelihoods, there are areas in a specific grid-like pattern where gray or white levels are strong and the likelihood is low. I can confirm that there is. This result can be considered to be due to a phenomenon in which the likelihood varies irregularly due to fluctuations due to the position of the detected object (one person in this example) in the screen as explained in FIGS. 7A and 7B. . In addition, if there is a specific grid-like pattern as shown in this example, there is a high possibility that there is a problem with the object detection model 300 itself, and the likelihood of a given region is low. In this case, it can be considered that there is a high possibility that the learning of the model learning dictionary 320 is insufficient. Detailed factor estimation regarding the fluctuation of the likelihood due to the position of the detected object will be mentioned in conjunction with the explanation of FIG. 11.
 なお、位置シフト機能220の各種加工パラメータ510である前述したパラメータS、N、Mは、用途や目的に応じて変更しても良い。なお、ピクセルステップの設定であるSは、水平方向と垂直方向にそれぞれ異なる値を設定しても良い。Sを小さく設定すれば詳細な検証が可能になるメリットがある反面、演算処理時間が大きくなるデメリットもある。水平方向のN回数分と垂直方向のM回数分の位置シフト用の加工パラメータは、物体検出モデル300の構造に応じて、位置揺らぎが検証できる適切な値に設定することが望ましい。 Note that the aforementioned parameters S, N, and M, which are the various processing parameters 510 of the position shift function 220, may be changed depending on the use and purpose. Note that S, which is the pixel step setting, may be set to different values in the horizontal direction and the vertical direction. Setting S to a small value has the advantage of allowing detailed verification, but has the disadvantage of increasing calculation processing time. The processing parameters for position shifting N times in the horizontal direction and M times in the vertical direction are preferably set to appropriate values that allow verification of positional fluctuations, depending on the structure of the object detection model 300.
 次に、図11に示す尤度のヒストグラム551は、図10に示す尤度分布541に対して確率統計演算手段520により算出された尤度(%)の度数を正規化(度数の合計が1.0)したものである。また、統計結果561は、尤度分布541に対する平均尤度(%)と尤度の標準偏差(%)と最大尤度(%)と最小尤度(%)を表示したものである。また、従来方法の尤度571は、前述した従来の第一の性能指標化装置30で算出される尤度に相当する図9に示す位置シフトの基準画像となるモデル入力画像231のピンポイントで算出した尤度を表示したものである。同様に、図11に示す尤度のヒストグラム552と553、統計結果562と563、および、従来方法の尤度572と573は、それぞれ、図10に示す尤度分布542と543に対応するものである。 Next, the likelihood histogram 551 shown in FIG. 11 is obtained by normalizing the frequency of the likelihood (%) calculated by the probability statistical calculation means 520 for the likelihood distribution 541 shown in FIG. .0). Further, the statistical result 561 displays the average likelihood (%), standard deviation of the likelihood (%), maximum likelihood (%), and minimum likelihood (%) for the likelihood distribution 541. Furthermore, the likelihood 571 of the conventional method corresponds to the likelihood calculated by the conventional first performance indexing device 30 described above, which is a pinpoint value of the model input image 231 serving as a reference image for position shift shown in FIG. The calculated likelihood is displayed. Similarly, likelihood histograms 552 and 553, statistical results 562 and 563, and likelihoods 572 and 573 of the conventional method shown in FIG. 11 correspond to likelihood distributions 542 and 543 shown in FIG. 10, respectively. be.
 統計結果561、562、および、563の中の平均尤度(%)は、画面中の位置による揺らぎに対する平均的な検出精度や検出性能を検証する指標であり、高いほどモデル学習辞書320を含む物体検出モデル300の性能が高いと考えて良い。また、尤度の標準偏差(%)は、画面中の位置による揺らぎに対する尤度のバラツキを示す指標であり、小さいほどモデル学習辞書320を含む物体検出モデル300の安定性が高いと考えて良い。反対に尤度の標準偏差(%)が大きい場合は、物体検出モデル300そのものに課題が存在するか、もしくは、画面中の検出物体位置に対するモデル学習辞書320の学習が不十分な場合のいずれかが考えられる。さらに、図10で説明した尤度分布541、542、および、543を確認することでいずれの要因が強いかを検証することが可能である。また、最大尤度(%)と最小尤度(%)も合わせて検証することにより、尤度のバラツキが正規分布に近いか否かを判別することも可能となる。最大尤度(%)と最小尤度(%)は高いほどモデル学習辞書320を含む物体検出モデル300の性能が高いと考えて良い。反面、極端に低くなるような場合は、物体検出モデル300そのものに課題が存在するか、もしくは、画面中の検出物体位置に対するモデル学習辞書320の学習が不十分な場合のいずれかが考えられる。 The average likelihood (%) in the statistical results 561, 562, and 563 is an index for verifying the average detection accuracy and detection performance with respect to fluctuations depending on the position in the screen, and the higher the value, the more the model learning dictionary 320 is included. It can be considered that the object detection model 300 has high performance. Further, the standard deviation (%) of the likelihood is an index indicating the dispersion of the likelihood with respect to fluctuations depending on the position in the screen, and it can be considered that the smaller the standard deviation (%), the higher the stability of the object detection model 300 including the model learning dictionary 320. . On the other hand, if the standard deviation (%) of the likelihood is large, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient. is possible. Furthermore, by checking the likelihood distributions 541, 542, and 543 explained in FIG. 10, it is possible to verify which factor is stronger. Furthermore, by verifying the maximum likelihood (%) and the minimum likelihood (%), it is also possible to determine whether the dispersion of the likelihood is close to a normal distribution. It can be considered that the higher the maximum likelihood (%) and the minimum likelihood (%), the higher the performance of the object detection model 300 including the model learning dictionary 320. On the other hand, if it becomes extremely low, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient.
 なお、本例では、検出対象が人物1名の場合を示しているが、検出対象が複人数の場合や、人物以外のクラスの物体が複数存在する場合は、それぞれの検出対象毎に、尤度分布とその統計結果、IOU分布とその統計結果、クラス識別分布とその統計結果を算出するものであっても良い。 Note that this example shows the case where the detection target is one person, but if there are multiple detection targets or there are multiple objects of classes other than people, the likelihood is calculated for each detection target. It may be possible to calculate a degree distribution and its statistical results, an IOU distribution and its statistical results, a class identification distribution and its statistical results.
 本発明の実施形態1による物体検出における性能指標化装置10により算出された検証結果を示す図10および図11を使用して、検出物体として人物1名が存在する3種類にリサイズされたモデル入力画像210における、水平方向の入力画素が128ピクセル、垂直方向の入力画素が128ピクセルで構成される図3A及び図3Bに示すYOLOモデル360(物体検出モデル300)の詳細な検出精度および検出性能の検証と、課題分析と要因分析を行う際の検証方法の一例を示す。 Using FIGS. 10 and 11 showing the verification results calculated by the performance indexing device 10 in object detection according to the first embodiment of the present invention, the model input is resized into three types in which one person exists as the detected object. Detailed detection accuracy and detection performance of the YOLO model 360 (object detection model 300) shown in FIGS. 3A and 3B, which consists of 128 input pixels in the horizontal direction and 128 pixels in the vertical direction in the image 210. An example of the verification method used when performing verification, issue analysis, and factor analysis is shown below.
 なお、本例で説明する検証方法の一例は、物体検出を行うためのカメラなどを小型化、省電力化、および、低コスト化のために、YOLOモデル360を動作させる手段として電子回路の実装面積や消費電力に制限が発生する場合や、メモリ容量などの制限や、搭載するDSP(デジタルシグナルプロセッサ)などの演算プロセッサの性能の制限などにより、YOLOモデル360に入力する画像サイズを本来の推奨されているYOLOモデル360の入力画像サイズよりも小さくしなければならないような場合を想定した検証結果であり、推奨されている各種バリエーションのYOLOモデル360で必ず発生するものではない。 An example of the verification method described in this example is the implementation of an electronic circuit as a means of operating the YOLO model 360 in order to miniaturize, save power, and reduce costs of cameras for detecting objects. The image size input to YOLO Model 360 may not be the original recommended size due to limitations in area or power consumption, limitations in memory capacity, or limitations in the performance of arithmetic processors such as the DSP (digital signal processor) installed. This is a verification result assuming a case where the input image size of the YOLO model 360 has to be made smaller than that of the currently used YOLO model 360, and does not always occur with the various recommended variations of the YOLO model 360.
 前述したように、図10に示す尤度分布541、542、および、543の中で、尤度が高い黒レベルの合間に、特定の格子状のようなパターンで灰色もしくは白レベルが強く尤度が低くなる領域が存在していることが確認できる。このため、図7Aで説明したように、同一の物体を検出していても画像内の検出物体の位置が揺らいだ場合に、検出結果の1つである尤度が大きくバラツクという現象が発生すると考えられる。ここで、尤度分布541と542にみられる特定の格子状のパターンは、約8ピクセル四方のパターンを特徴としており、尤度分布543にみられる特定の格子状のパターンは、約16ピクセル四方のパターンを特徴としている。これらパターンの特徴が異なる要因の1つは、検出物体のサイズに依存して図3Aに示すYOLOモデル360の第二の検出レイヤ362で検出されたものか、第三の検出レイヤ363で検出されたものかによる違いと考えられる。検出対象の人物サイズが小さい側の尤度分布541と542は、第三の検出レイヤ363の検出結果が主に出力されたと考えられる。また、人物サイズが大きい側の尤度分布540は、第二の検出レイヤ362の検出結果が主に出力されたと考えられる。一方で、特定の格子状のパターンで尤度が低下する現象が発生する要因としては、図3Bに示す条件付きクラス確率を算出する工程314のグリッドセル(図3A及び図3Bでは7x7の例を示す)に依存すると考えられる。処理速度に優位性が高いとされるYOLOなどに代表されるone―stage型のDNNモデルを使用する場合は、物体の位置の検出とクラス識別(分類)を同時に行うために領域を任意のサイズのグリッドセルに分割して条件付きクラス確率であるPr(Classi|Object)315を演算するため、平行して算出されたConfidence(信頼度)313と合わせて、最終検出工程316でConfidencescore(信頼度スコア)317を算出する際に双方を乗算するため、条件付きクラス確率のグリッドセルの構造に依存して境界上のConfidencescore(信頼度スコア)317(尤度に相当)に画像中の人の位置の揺らぎに応じて特定の格子状のパターンの尤度が低下する現象に結び付いたと考えられる。本例の検証結果に示すような特定の格子状のようなパターンがある場合は、物体検出モデル300のモデルそのものかアルゴリズムに課題が存在している可能性が高いと考えられるため、物体検出モデル300そのものの潜在的な課題の抽出や解決手段が策定できる可能性が高くなる。さらに、モデル学習辞書320の各種変動条件に対する汎用性やロバスト性の不完全であるか否かの切り分けも正確に行うことが可能となる。 As mentioned above, in the likelihood distributions 541, 542, and 543 shown in FIG. 10, between the black levels with high likelihoods, the gray or white levels have a strong likelihood in a particular grid-like pattern. It can be confirmed that there is a region where the value is low. Therefore, as explained in FIG. 7A, even when the same object is detected, if the position of the detected object in the image fluctuates, a phenomenon occurs in which the likelihood, which is one of the detection results, varies greatly. Conceivable. Here, the specific grid pattern seen in the likelihood distributions 541 and 542 is characterized by a pattern of about 8 pixels square, and the specific grid pattern seen in the likelihood distribution 543 is characterized by a pattern of about 16 pixels square. It is characterized by a pattern of One of the reasons why the characteristics of these patterns are different is that depending on the size of the detected object, it is either detected by the second detection layer 362 of the YOLO model 360 shown in FIG. 3A or detected by the third detection layer 363. The difference is thought to be due to the nature of the situation. It is considered that the likelihood distributions 541 and 542 on the side where the size of the person to be detected is smaller are mainly output from the detection results of the third detection layer 363. Furthermore, it is considered that the likelihood distribution 540 on the side where the person size is large is mainly the result of the detection of the second detection layer 362. On the other hand, the reason why the likelihood decreases in a specific grid pattern is that the grid cell (7x7 example is used in FIGS. 3A and 3B) in step 314 of calculating the conditional class probability shown in FIG. (shown). When using a one-stage DNN model such as YOLO, which is said to have a high processing speed, the region can be set to any size in order to detect the position of the object and identify the class (classification) at the same time. In order to calculate the conditional class probability Pr(Classi|Object) 315 by dividing it into grid cells of Since both are multiplied when calculating the score) 317, the position of the person in the image is determined by the Confidence score 317 (corresponding to the likelihood) on the boundary depending on the grid cell structure of the conditional class probability. This is thought to be linked to the phenomenon in which the likelihood of a particular grid pattern decreases in response to fluctuations in the lattice pattern. If there is a specific grid-like pattern as shown in the verification results of this example, there is a high possibility that there is a problem with the object detection model 300 itself or the algorithm. It becomes more likely that potential problems of 300 itself can be identified and solutions can be formulated. Furthermore, it is possible to accurately determine whether the model learning dictionary 320 has incomplete versatility or robustness with respect to various variation conditions.
 さらに、図11を使用して、より詳細な検証を行う。本来のYOLOの特徴を鑑みた場合は、人物サイズが小さくなる、もしくは、人物の距離が遠くなるにつれて検出精度や性能が低下することが知られている。ただし、新しいバージョンのYOLOでは改善報告もされている。本例で使用しているYOLOモデル360は、改善前のバージョンを採用しているものとする。まずは、従来の第一の性能指標化装置30で算出された尤度である従来方法の尤度(%)571、572、および、573の数値を比較すると、人物サイズが大きくなるにつれて、70.12%、49.27%、94.45%と変化する。ここで、基準サイズである従来方法の尤度(%)572の49.27%が、30%縮小された場合の従来方法の尤度(%)571よりも尤度が大きく低下している。このため、本結果を確認しただけでは、基準サイズの人物に対するモデル学習辞書320の学習が不十分ではないかとの誤った結論に達してしまい、不用な追加学習を行ってしまう場合がある。反面、30%縮小時の従来方法の尤度(%)571ある尤度70.12%は、合格点として、本来は追加学習すべきところを実施しないことで、モデル学習辞書320の汎用性やロバスト性の強化が不十分になることも考えられる。 Furthermore, a more detailed verification will be performed using FIG. Considering the original characteristics of YOLO, it is known that detection accuracy and performance deteriorate as the size of a person becomes smaller or as the distance between the person increases. However, improvements have been reported in the new version of YOLO. It is assumed that the YOLO model 360 used in this example employs an unimproved version. First, when comparing the numerical values of likelihoods (%) 571, 572, and 573 of the conventional method, which are the likelihoods calculated by the first conventional performance indexing device 30, as the person size increases, 70. It changes to 12%, 49.27%, and 94.45%. Here, the likelihood (%) 572 of the conventional method, which is the standard size, is 49.27%, which is much lower than the likelihood (%) 571 of the conventional method when the size is reduced by 30%. For this reason, simply checking this result may lead to an erroneous conclusion that the learning of the model learning dictionary 320 for a person of the reference size is insufficient, and unnecessary additional learning may be performed. On the other hand, the likelihood (%) 571 of the conventional method when reduced by 30% is 70.12%, which is considered a passing score, and the fact that additional learning is not performed in the first place reduces the versatility of the model learning dictionary 320. It is also conceivable that the robustness enhancement will be insufficient.
 一方で、本発明の実施形態1である物体検出における性能指標化装置10により算出された結果を検証する。尤度のヒストグラム551、552、および、553は、図10の尤度分布541、542、および、543の尤度がどのレベルで存在するかを示すものである。尤度が高い右端に発生頻度が集中している場合が、より良い性能であると考えてよい。また、バラツキが少ないほど安定していると考えて良い。尤度のヒストグラム551、552、および、553を確認した限りでは、従来方法の尤度(%)571、572、および、573と異なり、人物サイズが大きい順に尤度(%)が分布していることが分かる。図10に示す尤度分布541、542、および、543と、尤度のヒストグラム551、552、および、553をそれぞれ統計分析した結果である統計結果561、562、および、563を確認すると、従来方法の尤度(%)571、572、および、573で確認した結果と異なり、平均尤度(%)は、人物サイズが大きくなるにつれて、本来のあるべき姿に近い、60.85% < 71.82% < 89.98%と順に高くなっていることが分かる。したがって、従来方法の尤度(%)571、572、および、573の結果は、特定の格子状のパターンに依存して画像中の人物の位置の揺らぎにより検出結果にブレが生じてしまうという課題が確認された。 On the other hand, the results calculated by the performance indexing device 10 in object detection according to the first embodiment of the present invention will be verified. The likelihood histograms 551, 552, and 553 indicate at what level the likelihoods of the likelihood distributions 541, 542, and 543 in FIG. 10 exist. It can be considered that the performance is better when the occurrence frequency is concentrated on the right end where the likelihood is high. Also, it can be considered that the less variation there is, the more stable it is. As far as I checked the likelihood histograms 551, 552, and 553, unlike the likelihoods (%) 571, 572, and 573 of the conventional method, the likelihoods (%) are distributed in descending order of person size. I understand that. When checking the statistical results 561, 562, and 563, which are the results of statistical analysis of the likelihood distributions 541, 542, and 543 and the likelihood histograms 551, 552, and 553 shown in FIG. Unlike the results confirmed with the likelihood (%) of 571, 572, and 573, the average likelihood (%) becomes closer to the original ideal as the person size increases, 60.85% < 71. It can be seen that the rates increase in the order of 82% < 89.98%. Therefore, the results of likelihood (%) 571, 572, and 573 of the conventional method have the problem that the detection results are blurred due to fluctuations in the position of the person in the image depending on the specific grid pattern. was confirmed.
 また、ロバスト性検証手段500に備えられている学習強化必要項目抽出手段530を活用することにより、例えば、平均尤度(%)を70%以上にすることをモデル学習辞書320の開発目標としていた場合は、平均尤度(%)の閾値を70%に設定することで、30%縮小したサイズの場合は、従来方法の尤度(%)の結果では、たまたま達成できているように見えたが、実際は、閾値以下となり9%以上足りていないため、30%縮小した人物に対しては、追加学習による強化が必要であるということも洗い出すことが可能となる。また、例えば、尤度の標準偏差(%)に対する閾値を10%に設定して、10%以上となる尤度の標準偏差(%)を確認してみると、10%を超えている基準サイズの人物と30%縮小した人物が対象として抽出される。この結果、基準サイズの人物と30%縮小した人物に相当する物体に対して画面上の位置の揺らぎに対するモデル学習辞書320の強化が必要であることを確認することができる。さらに、尤度分布541と542、および、ヒストグラム561と562などの他の検証結果も参考にすることで、前述したDNNの構成やアルゴリズムに依存する潜在的な尤度低下が発生している可能性があるため改良や補強が必要になることを気づくことが可能となる。同様に最大尤度(%)と最小尤度(%)も各種判断の材料として活用が可能であり、例えば、最小尤度の閾値を30%に設定した場合、30%以下になる基準サイズの人物と30%縮小した人物に関しては、物体位置がその位置で停止した場合は、検出不能に陥る恐れもあるため、それら潜在的に発生する課題や問題を先行して抽出することも可能となる。さらに、物体検出モデル300、および、モデル前処理手段200やモデル後処理手段400の改良方法の策定にも結び付けていくことが可能となる。 Furthermore, by utilizing the learning reinforcement necessary item extraction means 530 provided in the robustness verification means 500, the development goal of the model learning dictionary 320 was to increase the average likelihood (%) to 70% or more, for example. In this case, by setting the average likelihood (%) threshold to 70%, in the case of a size reduced by 30%, the likelihood (%) result of the conventional method seemed to be achieved by chance. However, in reality, it is less than the threshold, which is more than 9% short, so it becomes possible to find out that reinforcement through additional learning is necessary for the person who has been reduced by 30%. For example, if you set the threshold for the standard deviation (%) of the likelihood to 10% and check the standard deviation (%) of the likelihood that is 10% or more, you will find that the standard deviation (%) that exceeds 10% The person and the person reduced by 30% are extracted as targets. As a result, it can be confirmed that it is necessary to strengthen the model learning dictionary 320 with respect to fluctuations in the position on the screen for objects corresponding to a person of standard size and a person reduced by 30%. Furthermore, by referring to other verification results such as likelihood distributions 541 and 542 and histograms 561 and 562, it is possible that a potential decrease in likelihood depending on the above-mentioned DNN configuration and algorithm has occurred. This makes it possible to notice the need for improvement or reinforcement due to the nature of the problem. Similarly, maximum likelihood (%) and minimum likelihood (%) can be used as materials for various judgments. For example, if the minimum likelihood threshold is set to 30%, the standard size that will be 30% or less Regarding a person and a person reduced by 30%, if the object position stops at that position, there is a risk that it may become undetectable, so it is also possible to extract these potential issues and problems in advance. . Furthermore, it is possible to link this to the formulation of methods for improving the object detection model 300, model preprocessing means 200, and model postprocessing means 400.
 一例として、図12に、人物が遠距離(画面の上の方)に位置する水平方向に128ピクセル、垂直方向に128ピクセルのモデル入力画像526を基準画像として、モデル前処理手段200の位置シフト機能220により1ピクセルステップ(S=1)で、水平方向に64回分(N=64)、垂直方向に64回分(M=64)の位置シフトした合計64×64個のモデル入力画像210に対して、確率統計演算手段520と学習強化必要項目抽出手段530を備えたロバスト性検証手段500によって、尤度分布544を算出した結果を示す。尤度分布544は、白色から黒色の濃淡バー521にしたがって、画面上の人物が存在する位置(ピクセル単位)の揺らぎに対する尤度(%)のレベルに応じて、白色(尤度0%相当)から黒色(尤度100%)の濃淡に色付けして表示したものである。人物1名に対する画面上の位置による揺らぎの検証範囲を広げることにより、尤度分布544の上側、つまり、点線で囲んだ領域527は、他の領域よりも白レベルが強く尤度が低い領域であることが分かる。本例では、点線で囲んだ領域527は、モデル入力画像526の人物の中心に対する右下側に広がる点線で囲んだ領域528に人物が存在した場合を示していると考えて良い。前述した、特定の格子状のパターンも観測できるが、点線で囲んだ領域527は特に尤度が低い領域が存在するため学習強化必要項目抽出手段530により特に集中して尤度が低下している領域であると抽出することが可能となる。したがって、モデル入力画像の人物が点線で囲んだ領域528に位置する場合は、物体検出の能力が低いことが確認できるため、モデル学習辞書320の強化が必要であることに気づくことが可能である。したがって、後述する実施形態2の辞書学習手段600によりモデル学習辞書320の強化を効率よく行うことができるため、モデル学習辞書320の検出物体や背景を含めた位置による揺らぎに対する汎用性やロバスト性の強化につながる。 As an example, in FIG. 12, a model input image 526 of 128 pixels in the horizontal direction and 128 pixels in the vertical direction in which a person is located far away (at the top of the screen) is used as a reference image, and the position of the model preprocessing means 200 is shifted. For a total of 64 x 64 model input images 210 whose positions have been shifted by 1 pixel step (S = 1) by the function 220 by 64 times (N = 64) in the horizontal direction and 64 times (M = 64) in the vertical direction. The results of calculating the likelihood distribution 544 by the robustness verification means 500 including the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 are shown. The likelihood distribution 544 changes from white (equivalent to 0% likelihood) according to the level of likelihood (%) for fluctuations in the position (in pixels) of the person on the screen, according to the gray scale bar 521 from white to black. It is displayed in shades of black (likelihood 100%). By expanding the verification range of fluctuations due to the position on the screen for one person, the upper side of the likelihood distribution 544, that is, the area 527 surrounded by the dotted line, is an area where the white level is stronger and the likelihood is lower than other areas. I understand that there is something. In this example, an area 527 surrounded by a dotted line can be considered to indicate a case where a person exists in an area 528 surrounded by a dotted line that extends to the lower right side of the center of the person in the model input image 526. The above-mentioned specific grid pattern can also be observed, but the area 527 surrounded by the dotted line has a particularly low likelihood, so the learning reinforcement necessary item extraction means 530 concentrates it and reduces the likelihood. If it is a region, it can be extracted. Therefore, if the person in the model input image is located in the area 528 surrounded by the dotted line, it can be confirmed that the object detection ability is low, and it can be realized that the model learning dictionary 320 needs to be strengthened. . Therefore, the model learning dictionary 320 can be efficiently strengthened by the dictionary learning means 600 of Embodiment 2, which will be described later. Leads to reinforcement.
 なお、本例の検証方法では、検出対象が人物1名の場合を示しているが、検出対象が複人数の場合や、人物以外のクラスの物体が複数存在する場合は、それぞれの検出対象毎に、尤度分布とその統計結果、IOU分布とその統計結果、クラス識別分布とその統計結果に対して学習強化必要項目抽出手段530を使用して抽出された検出物体と判定条件によりモデル学習辞書320の強化対象を特定しても良い。さらに、物体検出モデル300の課題抽出をしても良い。さらに、これらの抽出情報531を参考に、後述する辞書学習手段600によりモデル学習辞書320の汎用性やロバスト性を強化しても良い。 Note that the verification method in this example shows the case where the detection target is one person, but if there are multiple detection targets or if there are multiple objects of classes other than people, the verification method for each detection target is Then, a model learning dictionary is created based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530 for the likelihood distribution and its statistical results, the IOU distribution and its statistical results, and the class identification distribution and its statistical results. 320 reinforcement targets may be specified. Furthermore, problems for the object detection model 300 may be extracted. Furthermore, the versatility and robustness of the model learning dictionary 320 may be enhanced by a dictionary learning means 600, which will be described later, with reference to the extracted information 531.
 なお、本例では、YOLOに特定の制限を加えた場合に関して述べているが、物体検出モデル300に、同じone―stage型のSDDなどのDNNモデルに適用しても良い。また、物体の位置の検出とクラス識別を2段階に分けて処理するEfficientDetなどに代表されるtwo―stage型のDNNモデルに適用しても良い。また、ニューラルネットワークを使用しない物体検出モデルや機械学習モデルに適用しても良い。 Although this example describes a case in which specific restrictions are applied to YOLO, the object detection model 300 may be applied to a DNN model such as the same one-stage SDD. Furthermore, the present invention may be applied to a two-stage DNN model such as EfficientDet, which processes object position detection and class identification in two stages. Furthermore, it may be applied to object detection models and machine learning models that do not use neural networks.
 ここまでの実施の形態1で説明した、画像処理手段100とモデル前処理手段200と物体検出モデル300とモデル後処理手段400とロバスト性検証手段500による本発明の物体検出における性能指標化装置10により、以下のような有用性と効果が期待できる。 Performance indexing device 10 in object detection of the present invention using the image processing means 100, model pre-processing means 200, object detection model 300, model post-processing means 400, and robustness verification means 500 described in Embodiment 1 so far As a result, the following usefulness and effects can be expected.
 ある実施形態によれば、モデル前処理手段200の位置シフト機能220により加工された複数のモデル入力画像210に対して、各検出物体の位置に対する尤度分布540を確認することにより、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になるため、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題を正確に抽出することが可能となる。さらに、課題を解決するための手法や方式を効果的に策定することができるため、物体検出モデルの検出精度や検出性能の向上が可能となる。 According to an embodiment, the object detection model is created by checking the likelihood distribution 540 for the position of each detected object with respect to the plurality of model input images 210 processed by the position shift function 220 of the model preprocessing means 200. This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in the position of the detected object on the screen due to the potential problems that the neural network itself including the DNN model in the object detection model has. This makes it possible to accurately identify issues related to accuracy and performance during inference. Furthermore, since it is possible to effectively formulate methods and methods for solving problems, it is possible to improve the detection accuracy and detection performance of the object detection model.
 ある実施形態によれば、さらに、ロバスト性検証手段500が検出物体毎の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504と、尤度に対するIOU値505のいずれか、もしくは、すべてを算出する確率統計演算手段520を備えることにより、物体検出モデル300が有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になるため、物体検出モデル300の中のDNNモデル310を含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。さらに、課題を解決するための手法や方式をより効果的に策定することができるため、物体検出モデル300のさらなる検出精度や検出性能の向上が可能となる。さらに、位置シフト以外の各種加工パラメータ510と組み合わせた際には、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the robustness verification means 500 further calculates a likelihood distribution 540 indicating the dispersion due to the position shift of each detected object, an average likelihood 501 that is the average value of the valid area of the likelihood, and a likelihood a histogram 550, a standard deviation of likelihood 502 which is the standard deviation of the valid area of likelihood, a maximum likelihood 503 which is the maximum value of the valid area of likelihood, and a minimum value which is the minimum value of the valid area of likelihood. By providing a probability statistical calculation means 520 that calculates either or both of the likelihood 504 and the IOU value 505 for the likelihood, the detected object position on the screen can be This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in It becomes possible to extract Furthermore, since methods and methods for solving problems can be formulated more effectively, detection accuracy and detection performance of the object detection model 300 can be further improved. Furthermore, when combined with various machining parameters 510 other than position shift, the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
 ある実施形態によれば、さらに、正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、さらに、ロバスト性検証手段500が検出物体毎の位置シフトに伴うバラツキを示すIOU分布とクラス識別正解率分布、平均IOU値と、平均クラス識別正解率と、IOU値のヒストグラムと、クラス識別正解率のヒストグラムと、IOU値の標準偏差と、クラス識別正解率の標準偏差と、最大IOU値と、最大クラス識別正解率と、最小IOU値と、最小クラス識別正解率のいずれか、もしくは、すべてを算出する確率統計演算手段520を備えることにより、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって検出枠を含む位置情報やクラス識別情報が変動する特徴を抽出することが可能になる。したがって、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。さらに、課題を解決するための手法や方式をより効果的に策定することができるため、物体検出モデルのさらなる検出精度や検出性能の向上が可能となる。さらに、位置シフト以外の各種加工パラメータ510と組み合わせた際には、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, if there is position information 621 including a detection frame that is a correct answer and class identification information 622 that is a correct answer, the robustness verification means 500 further detects variations due to position shifts for each detected object. IOU distribution and class identification accuracy rate distribution showing the average IOU value, average class identification accuracy rate, histogram of IOU value, histogram of class identification accuracy rate, standard deviation of IOU value, standard of class identification accuracy rate The object detection model has Due to potential problems, it becomes possible to extract features in which position information including the detection frame and class identification information fluctuate due to fluctuations in the position of the detected object on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters 510 other than position shift, the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.
 ある実施形態によれば、さらに、モデル前処理手段200により各種加工パラメータ510として、L(任意の整数)種類の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、前述の位置シフト画像を生成することで、確率統計演算手段520を備えるロバスト性検証手段500によりL種類のサイズ別に各検出物体の位置に対する尤度分布540と尤度の有効領域の平均尤度501と尤度のヒストグラム550と尤度の標準偏差502と最大尤度503と最小尤度504とIOU値505を確認することが可能となる。さらに、L種類のサイズ別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を確認することが可能となる。したがって、検出物体サイズに対するDNNモデルの改良やモデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the model preprocessing means 200 further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnification as various processing parameters 510, and then performs the above-mentioned position shift. By generating the image, the robustness verification means 500 including the probability statistical calculation means 520 calculates the likelihood distribution 540 for the position of each detected object for each L size, the average likelihood 501 of the valid area of the likelihood, and the likelihood distribution 540 for the position of each detected object for each L size. It becomes possible to check the histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505. Furthermore, check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each of the L types of sizes, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. becomes possible. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary 320.
 ある実施形態によれば、さらに、モデル後処理手段400が個体識別手段410を有することにより、異常データの排除と検出物体毎に検出枠を含む位置情報と尤度情報を適した情報に補正することができるため、各検出物体の位置に対する尤度分布540と尤度の有効領域の平均尤度501と尤度のヒストグラム550と尤度の標準偏差502と最大尤度503と最小尤度504とIOU値505をより正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値をより正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書320の汎用性やロバスト性の強化をより正確に図ることが可能となる。 According to an embodiment, the model post-processing means 400 further includes an individual identification means 410, thereby eliminating abnormal data and correcting position information and likelihood information including a detection frame for each detected object to suitable information. Therefore, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, the histogram 550 of the likelihood, the standard deviation of the likelihood 502, the maximum likelihood 503, the minimum likelihood 504, It becomes possible to calculate the IOU value 505 more accurately. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.
 ある実施形態によれば、さらに、モデル後処理手段が、検出物体毎に正解となる検出枠を含む位置情報621とクラス識別情報622が存在する場合は、個体識別手段410により、異常データの排除と検出物体毎に検出枠を含む位置情報と尤度情報を最適な情報に補正することができるため、各検出物体の位置に対する尤度分布540と尤度の有効領域の平均尤度501と尤度のヒストグラム550と尤度の標準偏差502と最大尤度503と最小尤度504とIOU値505を正解データとの比較により正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を正解データとの比較により正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書320の汎用性やロバスト性の強化をより正確に図ることが可能となる。 According to an embodiment, if there is position information 621 and class identification information 622 including the correct detection frame for each detected object, the model post-processing means uses the individual identification means 410 to eliminate abnormal data. Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each detected object, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, and the likelihood It becomes possible to accurately calculate the degree histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.
 ある実施形態によれば、さらに、モデル後処理手段400は、各出力結果と各種加工パラメータ510を検出物体毎に個別に紐づけてロバスト性検証手段に出力する一連の手段により、各種加工パラメータ510別に、物体検出モデルが有する潜在的な課題のために画面中の検出物体位置のゆらぎによって尤度が変動する特徴を抽出することが可能になる。したがって、物体検出モデルの中のDNNモデルを含むニューラルネットワークそのものが潜在的に有する推論時の精度や性能に関する課題をより正確に抽出することが可能となる。 According to an embodiment, the model post-processing means 400 further associates each output result with the various machining parameters 510 for each detected object and outputs the results to the robustness verification means. Separately, due to potential problems with object detection models, it becomes possible to extract features whose likelihoods fluctuate due to fluctuations in the position of detected objects on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently.
 ある実施形態によれば、ロバスト性検証手段500が、さらに、各種加工パラメータ510別に、検出物体毎の尤度分布540における任意の閾値以下となる位置、もしくは、領域の抽出と、平均尤度501が任意の閾値以下となる検出物体の抽出と、尤度の標準偏差502が任意の閾値以上となる検出物体の抽出と、最大尤度503が任意の閾値以下となる検出物体の抽出と、最小尤度504が任意の閾値以下となる検出物体の抽出と、IOU値505が任意の閾値以下となる該検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段530を有することにより、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、より正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the likelihood distribution 540 for each detected object, and extracts the average likelihood 501 for each of the various processing parameters 510. extraction of detected objects for which the standard deviation 502 of the likelihood is equal to or greater than an arbitrary threshold; extraction of detected objects for which the maximum likelihood 503 is equal to or less than an arbitrary threshold; It has a learning reinforcement necessary item extraction means 530 that includes either or all of the following: extracting a detected object whose likelihood 504 is equal to or less than an arbitrary threshold; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold. By separating the weak points and reinforcement policies of the generality and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. from the potential problems of the neural network itself including the DNN model, and further It becomes possible to understand accurately. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
 ある実施形態によれば、ロバスト性検証手段500が、さらに、各種加工パラメータ510別に、検出物体毎のIOU分布における任意の閾値以下となる位置もしくは領域の抽出と、検出物体毎のクラス識別正解率分布における任意の閾値以下となる位置もしくは領域の抽出と、平均IOU値が任意の閾値以下となる検出物体の抽出と、平均クラス識別正解率が任意の閾値以下となる検出物体の抽出と、IOU値の標準偏差が任意の閾値以上となる検出物体の抽出と、クラス識別正解率の標準偏差が任意の閾値以上となる検出物体の抽出と、最大IOU値が任意の閾値以下となる検出物体の抽出と、最大IOU値が任意の閾値以下となる検出物体の抽出と、最小IOU値が任意の閾値以下となる検出物体の抽出のいずれか、最小クラス識別正解率が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段530を有することにより、検出枠を含む位置情報やクラス識別情報をもとに、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、より正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the robustness verification means 500 further extracts a position or region where the IOU distribution is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and determines the class identification accuracy rate for each detected object, for each of the various processing parameters 510. Extraction of a position or region in the distribution where the value is below an arbitrary threshold value, extraction of detected objects whose average IOU value is below an arbitrary threshold value, extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and IOU Extraction of detected objects whose standard deviation of values is greater than or equal to an arbitrary threshold, extraction of detected objects whose standard deviation of class classification accuracy rate is greater than or equal to an arbitrary threshold, and extraction of detected objects whose maximum IOU value is less than or equal to an arbitrary threshold. Extraction, Extraction of detected objects whose maximum IOU value is below an arbitrary threshold, Extraction of detected objects whose minimum IOU value is below an arbitrary threshold, Detection whose minimum class identification accuracy rate is below an arbitrary threshold The model learning dictionary 320 is created by deep learning or the like based on the position information including the detection frame and the class identification information by having the learning enhancement necessary item extraction means 530 that includes any or all of the object extraction functions. It becomes possible to more accurately understand the weaknesses in generality and robustness against various fluctuation conditions caused by this, as well as the strengthening policy, by separating them from the potential problems of the neural network itself, including the DNN model. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
 ある実施形態によれば、さらに、ロバスト性検証手段500の確率統計演算手段520、および、学習強化必要項目抽出手段530は、尤度とIOU値とクラス識別正解率をもとにした確率統計演算の際に、対象となる検出物体に関係する画素が任意の割合で欠落している画像に対しては、演算対象から除外するような機能を備えることにより、検証の基準となる画像中の物体の位置やモデル前処理手段200の各種加工パラメータ510の加工後の物体の位置に依存して検出対象物体の有効範囲が欠落するような場合でも、正確な物体検出モデル300の性能および特徴とモデル学習辞書320の汎用性およびロバスト性を検証することが可能となる。したがって、検出物体サイズに対するDNNモデルの改良やモデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 further perform probability statistical calculation based on the likelihood, IOU value, and class identification correct answer rate. In this case, by providing a function that excludes images that are missing pixels related to the target detected object at an arbitrary rate from the calculation target, the object in the image that serves as the reference for verification is provided. Even if the effective range of the object to be detected is missing depending on the position of the object and the position of the object after processing various processing parameters 510 of the model preprocessing means 200, the performance and characteristics of the object detection model 300 are accurate. It becomes possible to verify the versatility and robustness of the learning dictionary 320. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary 320.
 さらに、各種加工パラメータ510によるモデル前処理手段200の加工処理とロバスト性検証手段505の検証方法のバリエーションに関して、いくつかの例を述べる。 Furthermore, some examples will be described regarding variations in the processing by the model preprocessing means 200 and the verification method by the robustness verification means 505 using various processing parameters 510.
 ある実施形態によれば、モデル前処理手段200は、物体検出モデル300に入力する複数のモデル入力画像210を加工するに際して、各種加工パラメータ510として、さらに、P(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線を使用して、輝度レベルを任意の値に変更した画像を生成することを特徴とするものであっても良い。さらに、階調を変更した後、画像をS(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M×P個の階調変換および位置シフトされたモデル入力画像210を生成する位置シフト機能220を備えるものであっても良い。また、任意の領域を切り取る機能を備えるものであっても良い。なお、コントラスト補正曲線や階調変換曲線を使用して輝度レベルの変更を行う際は、画像処理プロセッサ290で実行して実現する機能であっても良い。 According to an embodiment, when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further includes P (arbitrary integer) types of contrast correction as various processing parameters 510. It may be characterized by generating an image with the brightness level changed to an arbitrary value using a curve or a gradation conversion curve. Furthermore, after changing the gradation, the image is shifted horizontally by N (any integer) times and vertically by M (any integer) times in S (any decimal) pixel steps. The image forming apparatus may include a position shift function 220 that generates a total of N×M×P tone-converted and position-shifted model input images 210. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the brightness level using a contrast correction curve or a gradation conversion curve, the function may be implemented by being executed by the image processing processor 290.
 一例として、図13に、一般的な晴れの昼間に撮影した基準の輝度レベル画像に対して、その状態を保持するような階調変換曲線265(P=2)を適用した基準輝度レベル画像262と、照度が低い雨や曇りの天候条件や、夜明け、夕刻、夜間の時間帯や、黒つぶれなどを模擬するような階調変換曲線264(P=1)を適用した結果として加工される輝度レベルを低くした輝度レベル低画像261と、照度が高い快晴の天候条件や逆光、白飛びや、強力なライトを照射した撮影スタジオなどを模擬するような階調変換曲線266(P=3)を適用した結果として加工される輝度レベルを高くした輝度レベル高画像263の3種類の階調変換画像を生成する場合を示す。それぞれの画像261、262、263に対して、図8に示すようにSピクセルステップでN×M個の位置シフトした画像を生成し、合計で3×N×M個の複数のモデル入力画像210を加工するものであっても良い。 As an example, FIG. 13 shows a standard brightness level image 262 obtained by applying a gradation conversion curve 265 (P=2) that maintains the state to a standard brightness level image taken on a normal sunny day. and brightness processed as a result of applying a gradation conversion curve 264 (P=1) that simulates rainy or cloudy weather conditions with low illuminance, dawn, dusk, and night time periods, and crushed blacks. A low brightness level image 261 with a lower level and a gradation conversion curve 266 (P=3) that simulates clear weather conditions with high illuminance, backlighting, overexposure, and a shooting studio illuminated with strong light. A case is shown in which three types of gradation-converted images, such as a high-luminance-level image 263 with a higher luminance level processed as a result of application, are generated. For each of the images 261, 262, and 263, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.
 図8および図13に示すようなモデル前処理手段200の位置シフト機能220、および、階調変換機能260により加工された複数のモデル入力画像210は、図1に示す物体検出モデル300とモデル後処理手段400により、複数のモデル入力画像210毎の第二の検出枠を含む位置情報401と第二の尤度情報402を算出した後、各種加工パラメータ510をもとに物体検出モデル300の汎用性やロバスト性を検証するロバスト性検証手段500に入力して、各種加工パラメータ510であるP(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線に対して、図10と図11で説明したような確率統計演算手段520により、人物1名の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504を算出するものであっても良い。 A plurality of model input images 210 processed by the position shift function 220 and tone conversion function 260 of the model preprocessing means 200 as shown in FIGS. 8 and 13 are combined with the object detection model 300 shown in FIG. After the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210, the general purpose of the object detection model 300 is calculated based on the various processing parameters 510. 10 and 11 for P (arbitrary integer) types of contrast correction curves or gradation conversion curves that are various processing parameters 510. The probability statistical calculation means 520 as described above generates a likelihood distribution 540 indicating the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, and a likelihood histogram 550. , the standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood, the maximum likelihood 503 which is the maximum value of the valid region of likelihood, and the minimum likelihood 504 which is the minimum value of the valid region of likelihood. It may also be something that calculates.
 さらに、正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、IOU値505を算出するものであっても良い。 Furthermore, if the position information 621 including the correct detection frame and the correct class identification information 622 exist, the IOU value 505 may be calculated.
 さらに、P(任意の整数)種類のコントラスト補正曲線、もしくは、階調変換曲線別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を算出するものであっても良い。 Furthermore, the distribution, histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate. , histogram, standard deviation, maximum value, and minimum value may be calculated.
 さらに、前述したロバスト性検証手段500の学習強化必要項目抽出手段530を備えるものであっても良い。 Furthermore, the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
 これらモデル前処理手段200が階調変換機能260を備えることにより、天候条件や撮影時間帯や撮影環境の照度条件により変化する検出物体と背景の輝度レベルに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 Since the model preprocessing means 200 is equipped with a gradation conversion function 260, it is possible to improve the DNN model for the brightness levels of the detected object and background that change depending on weather conditions, shooting time, and illuminance conditions of the shooting environment, and to use a general-purpose model learning dictionary. This makes it possible to enhance performance and robustness.
 ある実施形態によれば、モデル前処理手段200は、物体検出モデル300に入力する複数のモデル入力画像210を加工するに際して、各種加工パラメータ510として、さらに、Q(任意の整数)種類のアスペクト比率を使用して、アスペクト比を変更した画像を生成することを特徴とするものであっても良い。アスペクト比を変更した後、画像をS(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M×Q個のアスペクト比の変更および位置シフトされたモデル入力画像210を生成する位置シフト機能220を備えるものであっても良い。また、任意の領域を切り取る機能を備えるものであっても良い。なお、Q種類のアスペクト比率によりアスペクト比の変更を行う際は、アフィン変換関数291や射影変換関数292を画像処理プロセッサ290で実行して実現する機能であっても良い。 According to an embodiment, when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further uses Q (arbitrary integer) types of aspect ratios as various processing parameters 510. may be used to generate an image with a changed aspect ratio. After changing the aspect ratio, use a position shift of N (any integer) horizontally and M (any integer) vertically in S (any decimal) pixel steps. , a position shifting function 220 that generates a model input image 210 with a total of N×M×Q aspect ratio changes and position shifts. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the aspect ratio using Q types of aspect ratios, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
 一例として、図14に、基準となるモデル入力画像252(Q=2)の人物1名に対して、ある年齢の子供やふくよかな人物を模擬したアスペクト比になるように垂直方向に30%縮小したモデル入力画像251(Q=1)と、細身の人物を模擬したアスペクト比になるように水平方向に30%縮小したモデル入力画像252の3種類のアスペクト比を変更した画像を生成する場合を示す。それぞれの画像251,252,253に対して、図8に示すようにSピクセルステップでN×M個の位置シフトした画像を生成し、合計で3×N×M個の複数のモデル入力画像210を加工するものであっても良い。 As an example, in FIG. 14, one person in the reference model input image 252 (Q=2) is reduced by 30% in the vertical direction so that it has an aspect ratio that simulates a child of a certain age or a plump person. A case in which images are generated with three different aspect ratios: a model input image 251 (Q = 1) that has been reduced, and a model input image 252 that has been reduced by 30% in the horizontal direction to have an aspect ratio that simulates a slender person. show. For each of the images 251, 252, and 253, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.
 図8および図14に示すようなモデル前処理手段200の位置シフト機能220、および、アスペクト比変更機能250により加工された複数のモデル入力画像210は、図1に示す物体検出モデル300とモデル後処理手段400により、複数のモデル入力画像210毎の第二の検出枠を含む位置情報401と第二の尤度情報402を算出した後、各種加工パラメータ510をもとに物体検出モデル300の汎用性やロバスト性を検証するロバスト性検証手段500に入力して、各種加工パラメータ510であるQ(任意の整数)種類のアスペクト比率に対して、図10と図11で説明したような確率統計演算手段520により、人物1名の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504を算出するものであっても良い。 A plurality of model input images 210 processed by the position shift function 220 and aspect ratio change function 250 of the model preprocessing means 200 as shown in FIGS. 8 and 14 are combined with the object detection model 300 shown in FIG. After the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210, the general purpose of the object detection model 300 is calculated based on the various processing parameters 510. The input data is input to the robustness verification means 500 that verifies the stability and robustness, and the probability statistical calculations as explained in FIGS. The means 520 generates a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and an effective area of the likelihood. Even if it calculates the standard deviation of likelihood 502 which is the standard deviation, the maximum likelihood 503 which is the maximum value of the valid area of likelihood, and the minimum likelihood 504 which is the minimum value of the valid area of likelihood. good.
 さらに、正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、IOU値505を算出するものであっても良い。 Furthermore, if the position information 621 including the correct detection frame and the correct class identification information 622 exist, the IOU value 505 may be calculated.
 さらに、Q(任意の整数)種類のアスペクト比率別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を算出するものであっても良い。 Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. , the minimum value may be calculated.
 さらに、前述したロバスト性検証手段500の学習強化必要項目抽出手段530を備えるものであっても良い。 Furthermore, the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
 これらモデル前処理手段200がアスペクト比変更機能250を備えることにより、検出物体の様々なアスペクト比(率)に対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 By equipping the model preprocessing means 200 with the aspect ratio changing function 250, it is possible to improve the DNN model for various aspect ratios (ratios) of the detected object and to enhance the versatility and robustness of the model learning dictionary. Become.
 ある実施形態によれば、モデル前処理手段200は、物体検出モデル300に入力する複数のモデル入力画像210を加工するに際して、各種加工パラメータ510として、さらに、R(任意の整数)種類の角度を使用して、回転角度を変更した画像を生成することを特徴とするものであっても良い。回転角度を変更した後、画像をS(任意の小数)ピクセル(画素)ステップで、水平方向にN(任意の整数)回分、垂直方向にM(任意の整数)回分の位置シフトを使用して、合計N×M×R個の回転角度の変更および位置シフトされたモデル入力画像210を生成する位置シフト機能220を備えるものであっても良い。また、任意の領域を切り取る機能を備えるものであっても良い。なお、R種類の角度により回転角度の変更を行う際は、アフィン変換関数291や射影変換関数292を画像処理プロセッサ290で実行して実現する機能であっても良い。 According to an embodiment, the model preprocessing means 200 further processes R (arbitrary integer) types of angles as various processing parameters 510 when processing the plurality of model input images 210 input to the object detection model 300. It may also be characterized in that it is used to generate an image with a changed rotation angle. After changing the rotation angle, the image is moved in S (any decimal) pixel steps using a position shift of N (any integer) horizontally and M (any integer) vertically. , a total of N×M×R rotation angle changes and a position shift function 220 that generates a position-shifted model input image 210 may be provided. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the rotation angle using R types of angles, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.
 一例として、図15に、基準角度となるモデル入力画像242(R=2)の人物1名に対して、カメラなどの取り付け位置の違いや人物のポーズを模擬した左に45°回転したモデル入力画像241(R=1)と、カメラなどの取り付け位置の違いや人物のポーズを模擬した右に45°回転したモデル入力画像243の3種類の回転角度を変更した画像を生成する場合を示す。それぞれの画像241,242、243に対して、図8に示すようにSピクセルステップでN×M個の位置シフトした画像を生成し、合計で3×N×M個の複数のモデル入力画像210を加工するものであっても良い。 As an example, in FIG. 15, a model input image 242 (R=2) serving as a reference angle for one person is rotated 45 degrees to the left to simulate differences in the mounting position of the camera, etc., and the pose of the person. A case is shown in which images are generated with three different rotation angles: an image 241 (R=1) and a model input image 243 rotated by 45° to the right to simulate the difference in the mounting position of a camera or the pose of a person. For each of the images 241, 242, and 243, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.
 図8および図15に示すようなモデル前処理手段200の位置シフト機能220、および、回転機能240により加工された複数のモデル入力画像210は、図1に示す物体検出モデル300とモデル後処理手段400により、複数のモデル入力画像210毎の第二の検出枠を含む位置情報401と第二の尤度情報402を算出した後、各種加工パラメータ510をもとに物体検出モデル300の汎用性やロバスト性を検証するロバスト性検証手段500に入力して、各種加工パラメータ510であるR(任意の整数)種類の角度に対して、図10と図11で説明したような確率統計演算手段520により、人物1名の位置シフトに伴うバラツキを示す尤度分布540と、尤度の有効領域の平均値である平均尤度501と、尤度のヒストグラム550と、尤度の有効領域の標準偏差である尤度の標準偏差502と、尤度の有効領域の最大値である最大尤度503と、尤度の有効領域の最小値である最小尤度504を算出するものであっても良い。 A plurality of model input images 210 processed by the position shift function 220 and rotation function 240 of the model preprocessing means 200 as shown in FIGS. 8 and 15 are combined with the object detection model 300 shown in FIG. 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510. The robustness is input to the robustness verification means 500 for verifying robustness, and the probability statistical calculation means 520 as explained in FIGS. , a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a standard deviation of the effective area of the likelihood. It may be possible to calculate a standard deviation 502 of a certain likelihood, a maximum likelihood 503 which is the maximum value of the valid region of likelihood, and a minimum likelihood 504 which is the minimum value of the valid region of likelihood.
 さらに、正解となる検出枠を含む位置情報621と正解となるクラス識別情報622が存在する場合は、IOU値505を算出するものであっても良い。 Furthermore, if the position information 621 including the correct detection frame and the correct class identification information 622 exist, the IOU value 505 may be calculated.
 さらに、R(任意の整数)種類の角度別に各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を算出するものであっても良い。 Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate, The minimum value may be calculated.
 さらに、前述したロバスト性検証手段500の学習強化必要項目抽出手段530を備えるものであっても良い。 Furthermore, the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.
 これらモデル前処理手段200が回転機能240を備えることにより、検出物体の様々な回転角度に対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 By equipping the model preprocessing means 200 with the rotation function 240, it becomes possible to improve the DNN model for various rotation angles of the detected object and to enhance the versatility and robustness of the model learning dictionary.
 ある実施形態によれば、モデル前処理手段200は、物体検出モデル300に入力する複数のモデル入力画像210を加工するに際して、図8、図9、図14、および、図15の281、ないし、288で示す位置シフト処理やリサイズ処理やアスペクト比変更処理や回転処理により発生する有効画像が存在しない余白部分に、有効である画像の平均輝度レベルを算出して、その平均輝度レベルを一様に貼り付けて画像を生成する余白パディング機能280を備えるものであっても良い。また、画像処理手段100の出力画像に存在する有効画像領域で余白部分を補間するものであっても良い。また、余白部分を学習や推論に影響を与えないような画像で埋めても良い。 According to an embodiment, the model preprocessing means 200 processes the plurality of model input images 210 inputted to the object detection model 300 by processing 281 to 281 in FIGS. 8, 9, 14, and 15. Calculate the average brightness level of the valid image in the blank space where no valid image exists due to the position shift process, resize process, aspect ratio change process, or rotation process shown in 288, and make the average brightness level uniform. It may also include a margin padding function 280 for pasting to generate an image. Alternatively, the blank space may be interpolated using the effective image area existing in the output image of the image processing means 100. Furthermore, the blank space may be filled with images that do not affect learning or inference.
 これらモデル前処理手段200が余白パディング機能280を備えることにより、余白部分を含む特徴量が、物体検出モデル300の推論精度に与える影響を軽減できるため、各検出物体の位置に対する尤度分布540と尤度の有効領域の平均尤度501と尤度のヒストグラム550と尤度の標準偏差502と最大尤度503と最小尤度504とIOU値505をより正確に算出することが可能となる。さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値をより正確に確認することが可能となる。したがって、DNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化をより正確に図ることが可能となる。 Since the model preprocessing means 200 is equipped with the margin padding function 280, the influence of features including margins on the inference accuracy of the object detection model 300 can be reduced. It becomes possible to more accurately calculate the average likelihood 501 of the effective region of likelihood, the histogram 550 of likelihood, the standard deviation 502 of likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.
 ある実施形態によれば、さらに、モデル前処理手段200で加工に用いられる各種加工パラメータ510は、前述した位置シフト機能220を絡めたリサイズ機能230と回転機能240とアスペクト比変更機能250と階調変換機能260を相互に絡めた複数の加工を施すものであっても良い。また、ロバスト性検証手段500の確率統計演算手段520と学習強化必要項目抽出手段530は、それら複数の各種加工パラメータ510の相互依存性を分析する手法として活用するのもであっても良い。さらに、実施形態1では説明は省略したが、魚眼レンズを使用している場合の歪補正テーブル293を使用した歪補正や円筒変換などを行うためのデワープ機能270を用いて加工することで、各種の検出物や背景の歪みに対するDNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the various processing parameters 510 used for processing by the model preprocessing means 200 include a resizing function 230 involving the position shift function 220 described above, a rotation function 240, an aspect ratio changing function 250, and a gradation function. It is also possible to perform a plurality of processes in which the conversion functions 260 are intertwined with each other. Further, the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 may be used as a method for analyzing the interdependence of the plurality of various processing parameters 510. Furthermore, although the explanation was omitted in the first embodiment, various types of It becomes possible to improve the DNN model with respect to distortion of the detected object and background, and to enhance the versatility and robustness of the model learning dictionary.
 これら、ある実施形態の物体検出における性能指標化装置10を使用して、物体検出モデル300とモデル学習辞書320の検出性能、検出精度、および、バラツキや不完全性などの汎用性およびロバスト性の課題を検証した結果を踏まえて、物体検出モデル300の改良、および、解決及び強化する方向に実施形態2に後述する辞書学習手段600により繰り返し深層学習させることで、より検出能力が高く、各種変動条件に対しても汎用性とロバスト性が高い物体検出を実現することが可能となる。 Using the performance indexing device 10 for object detection according to an embodiment, the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, we will improve the object detection model 300 and repeatedly perform deep learning using the dictionary learning means 600 described later in Embodiment 2 to improve the object detection model 300 and solve and strengthen it. It becomes possible to realize object detection that is highly versatile and robust even under various conditions.
 (実施形態2)
 図16は、本発明の実施形態2による画像中の物体検出における性能指標化装置20を示すブロック図である。ここで、画像処理手段100と画像出力制御手段110と表示およびデータ格納手段120とモデル前処理手段200と物体検出モデル300とモデル後処理手段400とロバスト性検証手段500、および、それらに備えられ各手段、各機能、各工程、各ステップ、および、それらを実現するための各装置、各方法、各プログラムなどは、実施形態1と同じものを使用するため、実施形態2の文中での説明を省略する。なお、実施形態1で述べた他のある実施形態の各手段、各機能、各工程、各ステップ、各装置、各方法、および、各プログラムなども使用して実現しても良い。
(Embodiment 2)
FIG. 16 is a block diagram showing a performance indexing device 20 for object detection in an image according to Embodiment 2 of the present invention. Here, the image processing means 100, the image output control means 110, the display and data storage means 120, the model pre-processing means 200, the object detection model 300, the model post-processing means 400, the robustness verification means 500, and the equipment provided therefor. Each means, each function, each process, each step, and each device, each method, each program, etc. for realizing them are the same as those in Embodiment 1, so they will be explained in the text of Embodiment 2. omitted. In addition, each means, each function, each process, each step, each apparatus, each method, each program, etc. of the other embodiments described in Embodiment 1 may be used and implemented.
 なお、後述する本発明の実施形態2に記載している各手段、各機能、および、各工程は、それぞれをステップに、各装置は、それぞれを方法に置き換えても良い。また、本発明の実施形態2に記載している各手段と各装置は、コンピュータにより機能させるプログラムで実現しても良い。 Note that each means, each function, and each process described in Embodiment 2 of the present invention described later may be replaced with a step, and each device may be replaced with a method. Moreover, each means and each device described in Embodiment 2 of the present invention may be realized by a program operated by a computer.
 まずは、物体検出モデル300の構成要素の1つであるモデル学習辞書320を作成するための深層学習である辞書学習手段600の一例について説明する。 First, an example of the dictionary learning means 600, which is deep learning for creating the model learning dictionary 320, which is one of the components of the object detection model 300, will be described.
 最初に、深層学習のための素材データ(画像データ)が保存されている学習用素材データベース格納手段610から、使用目的に適切と考えられる学習用素材データを抽出する。学習用素材データベース格納手段610に格納されている学習のための素材データは、例えば、COCO(Common Object in Context)やPascal VOC Datasetなどの大規模なオープンソースのデータセットを活用したものでも良い。また、使用用途に応じて必要となる画像を、例えば、画像処理手段100から画像出力制御手段110を使って表示およびデータ格納手段120に格納した画像データを活用する場合もある。 First, learning material data that is considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data (image data) for deep learning is stored. The material data for learning stored in the learning material database storage means 610 may be one that utilizes a large-scale open source dataset such as COCO (Common Object in Context) or Pascal VOC Dataset. In addition, there are cases in which, for example, image processing means 100 displays a necessary image according to the purpose of use using image output control means 110, and image data stored in data storage means 120 is utilized.
 次に、学習用素材データベース格納手段610から抽出された学習用素材データに対して、アノテーション手段620によって、クラス識別情報と正解枠であるgroundtruth BBoxを付加して教師ありデータを作成する。ここで、COCOやPascalVOC Datasetなどのオープンソースのデータセットは、すでにアノテーションの処理が施されているデータがあれば、アノテーション手段620を使わず直接、教師ありデータとして活用しても良い。 Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data. Here, open source datasets such as COCO and Pascal VOC Dataset may be used directly as supervised data without using the annotation means 620 if the data has already been annotated.
 次に、教師ありデータは、Augment手段630によって、汎用性およびロバスト性を強化するために学習用画像631として水増しする。ここで、Augment手段630は、例えば、画像を水平方向と垂直方向の任意の位置にシフトする手段や、任意の倍率に拡大や縮小する手段や、任意の角度に回転させるための手段や、アスペクト比を変更する手段や、歪補正や円筒変換などを行うためのデワープ手段を備えており、使用目的に応じて、各種手段を組み合わせて画像を水増しするものである。 Next, the supervised data is augmented by an Augment means 630 as a learning image 631 to enhance versatility and robustness. Here, the Augment means 630 is, for example, a means for shifting an image to an arbitrary position in the horizontal and vertical directions, a means for enlarging or reducing an image to an arbitrary magnification, a means for rotating an image to an arbitrary angle, and a means for changing the aspect ratio. It is equipped with a means for changing the ratio and a dewarping means for performing distortion correction, cylindrical conversion, etc., and the image is padded by combining various means depending on the purpose of use.
 次に、Augment手段630によって水増しされた学習用画像631を深層学習手段640に入力して、DNNモデル310の重み係数を算出し、算出された重み係数を、例えば、ONNXフォーマットに変換してモデル学習辞書320を作成する。なお、ONNXフォーマット以外に変換してモデル学習辞書320を作成しても良い。ここで、深層学習手段640は、例えば、DNNモデル310にYOLOを適用した際は、darknetと呼ばれるオープンソースの学習環境と演算プロセッサ(パーソナルコンピュータやスーパーコンピュータを含む)により実現するものである。darknetには、ハイパーパラメータと呼ばれる学習用パラメータが存在しており、使用用途や目的に応じて、適切なハイパーパラメータを設定して、Augment手段630と合わせて、汎用性やロバスト性を強化することも可能である。深層学習手段640により作成されたモデル学習辞書320を物体検出モデル300に反映することで、画像内の物体の位置検出やクラス識別を行うことが可能となる。なお、深層学習手段640は、電子回路により構成されるものであっても良い。なお、適用するDNNモデル310に応じて、プログラム言語により構成された学習環境を使っても良い。 Next, the training image 631 padded by the Augment means 630 is input to the deep learning means 640 to calculate the weighting coefficients of the DNN model 310, and the calculated weighting coefficients are converted into the ONNX format for example. A learning dictionary 320 is created. Note that the model learning dictionary 320 may be created by converting into a format other than ONNX format. Here, for example, when YOLO is applied to the DNN model 310, the deep learning means 640 is realized by an open source learning environment called darknet and an arithmetic processor (including a personal computer and a supercomputer). The darknet has learning parameters called hyperparameters, and it is possible to set appropriate hyperparameters depending on the usage and purpose, and to strengthen versatility and robustness in conjunction with the augment means 630. is also possible. By reflecting the model learning dictionary 320 created by the deep learning means 640 on the object detection model 300, it becomes possible to detect the position of the object in the image and identify the class. Note that the deep learning means 640 may be configured by an electronic circuit. Note that a learning environment configured using a programming language may be used depending on the DNN model 310 to be applied.
 次に、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書320のロバスト性や強化方針を分析するための物体検出における性能指標化装置20の一例について説明する。 Next, an example of the performance indexing device 20 in object detection for analyzing the robustness and reinforcement policy of the model learning dictionary 320 of a model that detects the position of an object in an image and identifies its class will be described.
 前述した、学習用素材データベース格納手段610から、使用目的に対する必要な検出精度や検出性能や汎用性、および、ロバスト性を検証するためのバリデーション用素材データを抽出する。学習用素材データベース格納手段610に格納されているバリデーションのための画像データは、例えば、COCO(Common Object in Context)やPascal VOC Datasetなどの大規模なオープンソースのバリデーション用の画像データセットを活用したものでも良い。また、使用目的に対する必要な検出精度や検出性能や汎用性、および、ロバスト性を検証するための画像を、例えば、画像処理手段100から画像出力制御手段110を使って表示およびデータ格納手段120に格納した画像データを活用する場合もある。 Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610. The image data for validation stored in the learning material database storage means 610 is obtained by utilizing a large-scale open source validation image dataset such as COCO (Common Object in Context) or Pascal VOC Dataset. Anything is fine. In addition, images for verifying the detection accuracy, detection performance, versatility, and robustness necessary for the purpose of use are displayed and sent to the data storage means 120 from the image processing means 100 using the image output control means 110, for example. Stored image data may also be used.
 次に、学習用素材データベース格納手段610から抽出されたバリデーション用素材データに対して、アノテーション手段620によって、クラス識別情報と正解枠であるgroundtruth BBoxを付加してバリデーション用データ623を作成する。ここで、COCOやPascalVOC Datasetなどのオープンソースのデータセットは、すでにアノテーションの処理が施されているデータがあれば、アノテーション手段620を使わず直接、バリデーション用データ623として活用しても良い。 Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623. Here, open source datasets such as COCO and Pascal VOC Dataset may be used directly as validation data 623 without using the annotation means 620 if the data has already been annotated.
 次に、バリデーション用データ623を、物体検出モデル300と同等の推論(予測)能力と実施形態1で述べた個体識別手段410を備えたモデル後処理手段400を備えた第二のmAP算出手段650に入力して、正解枠であるgroundtruth BBoxと推論(予測)した結果として算出されるPredictedBBox(予測したBBox)を比較したIOU値653の算出と、すべてのバリデーション用データ623に対するすべての予測結果の内、正しくIOU値653が任意の閾値以上で予測できた割合を示すPrecision654の算出と、実際の正解結果の内、IOU値653が任意の閾値以上で正解結果と近い位置のBBoxを予測できた割合を示すRecall655の算出と、前述した物体検出の精度や性能を比較するための指標としての各クラス別のAP(AveragePrecision)値651と、全クラスを平均化したmAP(mean Average Precision)値652を算出するものであっても良い。(例えば、非特許文献2参照)ここで、第二のmAP算出手段650は、例えば、DNNモデル310にYOLOを適用した際は、darknetと呼ばれるオープンソースの推論環境と演算プロセッサ(パーソナルコンピュータやスーパーコンピュータを含む)を備えたものであり、物体検出モデル300と同等の推論(予測)性能を有していることが望ましい。さらに、実施形態1で述べた個体識別手段410を備えたモデル後処理手段400の出力結果である第二の検出枠を含む位置情報401と第二の尤度情報402をもとに、IOU値653とPrecision654とRecall655とAP値651とmAP値652の算出手段を備えるものであっても良い。 Next, the validation data 623 is transferred to a second mAP calculation means 650 equipped with a model post-processing means 400 having inference (prediction) ability equivalent to that of the object detection model 300 and the individual identification means 410 described in the first embodiment. Calculation of IOU value 653 by comparing groundtruth BBox, which is the correct answer frame, and PredictedBBox (predicted BBox) calculated as a result of inference (prediction), and calculation of all prediction results for all validation data 623. Calculation of Precision 654, which indicates the percentage of the IOU values 653 that were correctly predicted above an arbitrary threshold value, and among the actual correct results, the IOU value 653 was above an arbitrary threshold value and the BBox in a position close to the correct result was predicted. Calculation of Recall 655 indicating the ratio, AP (Average Precision) value 651 for each class as an index for comparing the accuracy and performance of object detection mentioned above, and mAP (mean Average Precision) value 652 averaged over all classes. It may also be something that calculates. (For example, see Non-Patent Document 2) Here, for example, when YOLO is applied to the DNN model 310, the second mAP calculation means 650 uses an open source inference environment called darknet and an arithmetic processor (personal computer or super It is desirable that the object detection model 300 has the same inference (prediction) performance as the object detection model 300. Further, the IOU value is 653, Precision 654, Recall 655, AP value 651, and mAP value 652 calculation means may be provided.
 これら、実施形態1で述べた画像処理手段100とモデル前処理手段200と物体検出モデル300とモデル後処理手段400とロバスト性検証手段500と学習用素材データベース格納手段610とアノテーション手段620と第二のmAP算出手段650によりIOU値653とPrecision654とRecall655とAP値651とmAP値652を生成する一連の手段が、画像内の物体の位置検出やクラス識別を行うモデルのモデル学習辞書のロバスト性や強化方針を分析するための本発明における物体検出における性能指標化装置20である。 These, the image processing means 100, the model preprocessing means 200, the object detection model 300, the model postprocessing means 400, the robustness verification means 500, the learning material database storage means 610, the annotation means 620, and the second A series of means for generating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 by the mAP calculation means 650 of This is a performance indexing device 20 in object detection according to the present invention for analyzing a reinforcement policy.
 ある実施形態によれば、実施形態1の図6A及び図6Bで説明したようなモデル後処理手段400の個体識別手段410を実施形態2の第二のmAP算出手段を備えることにより、異常データの排除と検出物体毎に検出枠を含む位置情報と尤度情報を最適な情報に補正することができるため、各検出物体の位置に対する尤度分布540と尤度の有効領域の平均尤度501と尤度のヒストグラム550と尤度の標準偏差502と最大尤度503と最小尤度504とIOU値505を正解データとの比較により正確に算出することが可能となる。したがって、DNNモデルの改良やモデル学習辞書の汎用性やロバスト性の強化をより正確に図ることが可能となる。したがって、バリデーションデータを用いた全体的、および、平均的な推論精度や性能の指標である、IOU値653とPrecision654とRecall655とAP値651とmAP値652をより正確に算出することが可能となり、全体的な物体検出モデル300とモデル学習辞書320の指標化の精度が向上する。 According to an embodiment, the individual identification means 410 of the model post-processing means 400 as described in FIGS. 6A and 6B of the first embodiment is provided with the second mAP calculation means of the second embodiment, so that abnormal data can be Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each exclusion and detection object, the likelihood distribution 540 for the position of each detection object and the average likelihood 501 of the effective area of the likelihood It becomes possible to accurately calculate the likelihood histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary. Therefore, it is possible to more accurately calculate the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652, which are indicators of overall and average inference accuracy and performance using validation data. The accuracy of indexing the overall object detection model 300 and model learning dictionary 320 is improved.
 ある実施形態によれば、ロバスト性検証手段500が、各種加工パラメータ510別に、検出物体毎の尤度分布における任意の閾値以下となる位置もしくは領域の抽出と、平均尤度501が任意の閾値以下となる検出物体の抽出と、尤度の標準偏差502が任意の閾値以上となる検出物体の抽出と、最大尤度503が任意の閾値以下となる検出物体の抽出と、最小尤度504が任意の閾値以下となる検出物体の抽出のいずれか、もしくは、すべてを備える学習強化必要項目抽出手段530を有することにより、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、より正確に把握することが可能となる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the robustness verification means 500 extracts a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, and extracts a position or region where the average likelihood 501 is below an arbitrary threshold value, for each of the various processing parameters 510. Extraction of detected objects for which the standard deviation 502 of the likelihood is greater than or equal to an arbitrary threshold value, extraction of detected objects for which the maximum likelihood 503 is less than or equal to an arbitrary threshold value, and extraction of detected objects for which the standard deviation 502 of the likelihood is equal to or less than an arbitrary threshold value, By having the learning enhancement necessary item extraction means 530 that includes any or all of the detection objects that are below the threshold of It becomes possible to more accurately understand weak points in robustness and strengthening policies for neural networks by separating them from potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
 さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を任意の閾値による検出物体の抽出を行う学習強化必要項目抽出手段530を備えることにより、よりモデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are calculated using arbitrary thresholds. By providing the learning enhancement necessary item extraction means 530 that performs extraction, it becomes possible to further enhance the versatility and robustness of the model learning dictionary 320.
 ある実施形態によれば、確率統計演算手段520によって算出された尤度分布540、平均尤度501、尤度のヒストグラム550、尤度の標準偏差502、最大尤度503、最小尤度504、IOU値505などに基づき分析した結果、モデル学習辞書320が性能不十分であると判断した場合は、学習強化必要項目抽出手段530の結果に基づいて、学習画像を準備して、内蔵もしくは外部の辞書学習手段600により再学習することを特徴とするものであっても良い。モデル学習辞書320を再学習することにより、検出物体近傍の任意の範囲の位置シフト以外の各種加工パラメータ510(画面中の物体の左右上下と奥行などの位置、物体サイズ、コントラスト、階調、アスペクト比率、回転など)と組み合わせた際には、深層学習などにより作成されるモデル学習辞書320に起因する汎用性や各種変動条件に対するロバスト性の弱点や強化方針を、DNNモデルを含むニューラルネットワークそのものが潜在的に有する課題と切り分けて、正確に把握することが可能なる。したがって、深層学習などにより効果的な学習用画像データや教師ありデータを適用できるため、モデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 According to an embodiment, the likelihood distribution 540, average likelihood 501, likelihood histogram 550, likelihood standard deviation 502, maximum likelihood 503, minimum likelihood 504, and IOU calculated by the probability statistical calculation means 520. As a result of analysis based on the value 505, etc., if it is determined that the performance of the model learning dictionary 320 is insufficient, a learning image is prepared based on the result of the learning reinforcement required item extraction means 530, and a built-in or external dictionary is used. It may be characterized by re-learning by the learning means 600. By relearning the model learning dictionary 320, various processing parameters 510 other than position shift in an arbitrary range near the detected object (position such as left, right, top, bottom, and depth of the object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.), the neural network itself, including the DNN model, will be able to overcome weaknesses in versatility and robustness against various fluctuating conditions and enhancement policies caused by the model learning dictionary 320 created by deep learning etc. It will be possible to separate them from potential issues and accurately understand them. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.
 さらに、各検出物体の位置に対するIOU値の分布、ヒストグラム、標準偏差、最大値、最小値と、クラス識別正解率の分布、ヒストグラム、標準偏差、最大値、最小値を算出する確率統計演算手段520を備えることにより、よりモデル学習辞書320の汎用性やロバスト性の強化を図ることが可能となる。 Furthermore, probability statistical calculation means 520 calculates the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. By providing the model learning dictionary 320, it is possible to further enhance the versatility and robustness of the model learning dictionary 320.
 これら、ある実施形態の物体検出における性能指標化装置20を使用して、物体検出モデル300とモデル学習辞書320の検出性能、検出精度、および、バラツキや不完全性などの汎用性およびロバスト性の課題を検証した結果を踏まえて、物体検出モデル300の改良、および、解決及び強化する方向に辞書学習手段600により繰り返し深層学習させることで、より検出能力が高く、各種変動条件に対しても汎用性とロバスト性が高い物体検出を実現することが可能となる。 Using the performance indexing device 20 in object detection of a certain embodiment, the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, the object detection model 300 is improved, and the dictionary learning means 600 repeatedly performs deep learning to solve and strengthen the object detection model, resulting in higher detection ability and a general-purpose model that can be used under various fluctuating conditions. This makes it possible to realize object detection with high performance and robustness.
 (まとめ)
 図18は、本発明の物体検出モデルの性能指標化装置の要約を示す図である。図18に示すように、本発明の物体検出モデルにおける性能指標化装置、方法、及びプログラムは、検出対象を含む画像を取得して適切に加工する画像処理手段により生成された画像データに対して、サイズ変更や位置シフト等の各種加工パラメータを使用して複数の画像に加工するモデル前処理手段を有し、該複数の加工画像を学習済みのモデル学習辞書を備えた物体検出モデルに入力して得られた推論情報から、モデル後処理手段により検出物体毎の検出枠の位置情報と尤度情報を算出した後、ロバスト性検証手段により各種の前記各種加工パラメータに対する物体の位置揺らぎに対する平均尤度や尤度の標準偏差等の性能指標化を行う。さらに、該性能指標化の結果を踏まえて辞書学習手段により前記モデル学習辞書のロバスト強化を行う。
(summary)
FIG. 18 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention. As shown in FIG. 18, the performance indexing device, method, and program for the object detection model of the present invention apply to image data generated by an image processing means that acquires an image including a detection target and processes it appropriately. , has a model preprocessing means that processes multiple images using various processing parameters such as size change and position shift, and inputs the multiple processed images to an object detection model equipped with a trained model learning dictionary. After the model post-processing means calculates the position information and likelihood information of the detection frame for each detected object from the inference information obtained, the robustness verification means calculates the average likelihood with respect to the object position fluctuation for each of the various processing parameters Performance indicators such as degree and standard deviation of likelihood are calculated. Furthermore, based on the results of the performance indexing, the dictionary learning means performs robust reinforcement of the model learning dictionary.
 以上、一つまたは複数の態様に係る性能指標化装置などについて、実施の形態に基づいて説明したが、本発明は、この実施の形態に限定されるものではない。本開示の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構築される形態も、一つまたは複数の態様の範囲内に含まれてもよい。 Although the performance indexing device and the like according to one or more aspects have been described above based on the embodiment, the present invention is not limited to this embodiment. Unless departing from the spirit of the present disclosure, various modifications that can be thought of by those skilled in the art to this embodiment, and forms constructed by combining components of different embodiments are also within the scope of one or more embodiments. may be included within.
 例えば、上記実施の形態において、各構成要素は、専用のハードウェアで構成されるが、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、CPUまたはプロセッサなどのプログラム実行部が、ハードディスクまたは半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。ここで、上記実施の形態の性能指標化装置などを実現するソフトウェアは、次のようなプログラムである。 For example, in the above embodiments, each component is configured with dedicated hardware, but it may also be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. Here, the software that implements the performance indexing device and the like of the above embodiment is the following program.
 すなわち、このプログラムは、コンピュータに、性能指標化方法を実行させるプログラムである。 That is, this program is a program that causes a computer to execute a performance indexing method.
 本発明は、物体検出モデルを使用して画像中で物体の位置やクラス識別を行う技術分野で有用である。その中でも、物体検出を行うためのカメラなどを小型化、省電力化、および、低コスト化するための技術分野で特に有用である。 The present invention is useful in the technical field of identifying the position and class of an object in an image using an object detection model. Among these, it is particularly useful in the technical field of reducing the size, power consumption, and cost of cameras and the like for detecting objects.
10、20 性能指標化装置
30 第一の性能指標化装置
40 第二の性能指標化装置
100 画像処理手段
101 レンズ
102 イメージセンサ
103、290 画像処理プロセッサ
110 画像出力制御手段
120 表示およびデータ格納手段
200 モデル前処理手段
201、202、203、204、205、206、210、221、222、223、224、231、232、233、241、242、243、251、252、253、261、262、263、311、440、470、526 モデル入力画像
207、208、209、211、212、213、401、451、452、490、491 第二の検出枠を含む位置情報
214、215、216、217、218、219、453、454、492、493 第二の尤度情報の中の尤度
220 位置シフト機能
230 リサイズ機能
240 回転機能
250 アスペクト比変更機能
260 階調変換機能
264、265、266 階調変換曲線
270 デワープ機能
280 余白パディング機能
281、282、283、284、285、286、287、288 余白部分
291 アフィン変換関数
292 射影変換関数
293 歪補正テーブル
300 物体検出モデル
301、441、442、443、444、471、472、473、474 第一の検出枠を含む位置情報
302 第一の尤度情報
310 DNNモデル
312 複数のBounding BBoxと信頼度を推測する工程
313 Confidence(信頼度)
314 条件付きクラス確率を算出する工程
315 条件付きクラス確率
316 最終検出工程
317 Confidence score(信頼度スコア)
318 第一の検出枠を含む位置情報の検出枠
320 モデル学習辞書
330 人工ニューロンモデル
340 ニューラルネットワーク
350 活性化関数
351 シグモイド関数
352 ReLU
353 Leaky ReLU
360 YOLOモデル
361 第一の検出レイヤ
362 第二の検出レイヤ
363 第三の検出レイヤ
364、365 Upsamplingレイヤ
366、367 スキップ接続
370、371、372、373、374、375、376、377、378、379、380、381、382、383、384、385、386、387 Convolutionレイヤ
390、391、392、393、394、395 Poolingレイヤ
400 モデル後処理手段
402 第二の尤度情報
403 検出結果
410 個体識別手段
420、427、494、495、505、653 IOU値
422 Area ofUnion
423 Area ofIntersection
424 人物
425 ground truth BBox
426 Predicted BBox
445、446、447、448、475、476、477、478 第一の尤度情報の中の尤度
480、481、621 正解となる検出枠を含む位置情報
482、483、622 正解となるクラス識別情報
500 ロバスト性検証手段
501 平均尤度
502 尤度の標準偏差
503 最大尤度
504 最小尤度
510、511、512、513 各種加工パラメータ
520 確率統計演算手段
521 白色から黒色の濃淡バー
522、523、524、525 尤度
527、528 領域
530 学習強化必要項目抽出手段
531 抽出情報
540、541、542、543、544 尤度分布
550、551、552、553 尤度のヒストグラム
561、562,563 統計結果
571、572,573 従来方法の尤度
600 辞書学習手段
610 学習用素材データベース格納手段
620 アノテーション手段
623 バリデーション用データ
630 Augment手段
631 学習用画像
640 深層学習手段
650 第二のmAP算出手段
651 AP(Average Precision)値
652 mAP(mean Average Precision)値
654 Precision
655 Recall
660 第一のmAP算出手段
S430、S460 入力ステップ
S431 設定ステップ
S432、S435、S462 比較ステップ
S433 削除ステップ
S434 相互IOU値算出ステップ
S436 最大尤度判定ステップ
S437、S464 出力ステップ
S461 正解枠とのIOU値算出ステップ
S463 クラス識別判定ステップ
10, 20 Performance indexing device 30 First performance indexing device 40 Second performance indexing device 100 Image processing means 101 Lens 102 Image sensor 103, 290 Image processing processor 110 Image output control means 120 Display and data storage means 200 Model preprocessing means 201, 202, 203, 204, 205, 206, 210, 221, 222, 223, 224, 231, 232, 233, 241, 242, 243, 251, 252, 253, 261, 262, 263, 311, 440, 470, 526 Model input images 207, 208, 209, 211, 212, 213, 401, 451, 452, 490, 491 Position information including second detection frame 214, 215, 216, 217, 218, 219, 453, 454, 492, 493 Likelihood in second likelihood information 220 Position shift function 230 Resize function 240 Rotation function 250 Aspect ratio change function 260 Tone conversion function 264, 265, 266 Tone conversion curve 270 Dewarp function 280 Margin padding function 281, 282, 283, 284, 285, 286, 287, 288 Margin portion 291 Affine transformation function 292 Projective transformation function 293 Distortion correction table 300 Object detection model 301, 441, 442, 443, 444, 471 , 472, 473, 474 Position information including the first detection frame 302 First likelihood information 310 DNN model 312 Step of estimating multiple Bounding BBoxes and reliability 313 Confidence
314 Step of calculating conditional class probability 315 Conditional class probability 316 Final detection step 317 Confidence score
318 Detection frame of position information including the first detection frame 320 Model learning dictionary 330 Artificial neuron model 340 Neural network 350 Activation function 351 Sigmoid function 352 ReLU
353 Leaky ReLU
360 YOLO model 361 First detection layer 362 Second detection layer 363 Third detection layer 364, 365 Upsampling layer 366, 367 Skip connection 370, 371, 372, 373, 374, 375, 376, 377, 378, 379 , 380, 381, 382, 383, 384, 385, 386, 387 Convolution layer 390, 391, 392, 393, 394, 395 Pooling layer 400 Model post-processing means 402 Second likelihood information 403 Detection result 410 Individual identification means 420, 427, 494, 495, 505, 653 IOU value 422 Area of Union
423 Area of Intersection
424 People 425 ground truth BBox
426 Predicted BBox
445, 446, 447, 448, 475, 476, 477, 478 Likelihood in the first likelihood information 480, 481, 621 Position information including the detection frame that is the correct answer 482, 483, 622 Class identification that is the correct answer Information 500 Robustness verification means 501 Average likelihood 502 Standard deviation of likelihood 503 Maximum likelihood 504 Minimum likelihood 510, 511, 512, 513 Various processing parameters 520 Probability statistical calculation means 521 White to black shading bars 522, 523, 524, 525 Likelihood 527, 528 Region 530 Learning reinforcement necessary item extraction means 531 Extraction information 540, 541, 542, 543, 544 Likelihood distribution 550, 551, 552, 553 Likelihood histogram 561, 562, 563 Statistical result 571 , 572,573 Likelihood of conventional method 600 Dictionary learning means 610 Learning material database storage means 620 Annotation means 623 Validation data 630 Augment means 631 Learning image 640 Deep learning means 650 Second mAP calculation means 651 AP (Average Precision ) value 652 mAP (mean average precision) value 654 Precision
655 Recall
660 First mAP calculation means S430, S460 Input step S431 Setting step S432, S435, S462 Comparison step S433 Deletion step S434 Mutual IOU value calculation step S436 Maximum likelihood determination step S437, S464 Output step S461 IOU value calculation with correct answer frame Step S463 Class identification determination step

Claims (19)

  1.  物体検出モデルにおける性能指標化装置であって、
     画像を取得して適切に加工する画像処理手段と、
     前記画像処理手段により取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理手段と、
     前記モデル前処理手段で加工された前記複数の画像の入力に対して物体位置と尤度を推論するモデル学習辞書を含む物体検出モデルと、
     前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報とを、適正な値である第二の検出枠を含む位置情報と第二の尤度情報とに補正するモデル後処理手段と、
     前記モデル後処理手段の出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータとをもとに、前記物体検出モデルのロバスト性を検証するロバスト性検証手段とを備える、
     性能指標化装置。
    A performance indexing device for an object detection model, comprising:
    an image processing means for acquiring and appropriately processing images;
    model preprocessing means for processing the image acquired by the image processing means into a plurality of images according to various processing parameters;
    an object detection model including a model learning dictionary that infers object positions and likelihoods based on the input of the plurality of images processed by the model preprocessing means;
    Based on the inference result of the object detection model, position information and first likelihood information including the first detection frame are determined for each detected object in the plurality of images, and a second detection frame having an appropriate value is determined. model post-processing means for correcting the included position information and second likelihood information;
    Robustness verification that verifies the robustness of the object detection model based on the position information including the second detection frame that is the output result of the model post-processing means, the second likelihood information, and the various processing parameters. comprising means;
    Performance indexing device.
  2.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、Sピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M個の位置シフト画像を生成する、
     請求項1に記載の性能指標化装置。
    When processing the plurality of images to be input to the object detection model, the model preprocessing means processes N times (N is N in the horizontal direction) in S pixel steps (S is an arbitrary decimal number) as the various processing parameters. , an arbitrary integer), and generate a total of N×M position shifted images using M position shifts in the vertical direction (M is an arbitrary integer).
    The performance indexing device according to claim 1.
  3.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、L種類(Lは、任意の整数)の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、該画像をSピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M×L個の位置シフト画像を生成する、
     請求項1に記載の性能指標化装置。
    When processing the plurality of images to be input to the object detection model, the model preprocessing means enlarges or reduces the images using L types of arbitrary magnifications (L is an arbitrary integer) as the various processing parameters. After generating an image, the image is processed in S pixel steps (S is an arbitrary decimal number) N times in the horizontal direction (N is an arbitrary integer) and M times in the vertical direction (M is an arbitrary integer). generate a total of N×M×L position-shifted images using the position shift of
    The performance indexing device according to claim 1.
  4.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、P種類(Pは、任意の整数)のコントラスト補正曲線、もしくは、階調変換曲線を使用して、輝度レベルを任意の値に変更した画像を生成する、
     請求項1に記載の性能指標化装置。
    When processing the plurality of images to be input to the object detection model, the model preprocessing means may process P types (P is any integer) of contrast correction curves or gradation conversion curves as the various processing parameters. to generate an image with the brightness level changed to an arbitrary value, using
    The performance indexing device according to claim 1.
  5.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、Q種類(Qは、任意の整数)のアスペクト比率を使用して、アスペクト比を変更した画像を生成する、
     請求項1に記載の性能指標化装置。
    The model preprocessing means, when processing the plurality of images to be input to the object detection model, uses Q types (Q is an arbitrary integer) of aspect ratios as the various processing parameters to determine the aspect ratio. generate a modified image,
    The performance indexing device according to claim 1.
  6.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、R種類(Rは、任意の整数)の角度を使用して、回転角度を変更した画像を生成する、
     請求項1に記載の性能指標化装置。
    The model preprocessing means changes the rotation angle using R types of angles (R is any integer) as the various processing parameters when processing the plurality of images input to the object detection model. generate an image,
    The performance indexing device according to claim 1.
  7.  前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、該加工により発生する有効画像が存在しない余白部分に、該有効画像の平均輝度レベルを貼り付けて画像を生成する、
     請求項1に記載の性能指標化装置。
    The model preprocessing means, when processing the plurality of images to be input to the object detection model, pastes the average brightness level of the valid images into a blank area where no valid images are generated due to the processing, and processes the images. generate,
    The performance indexing device according to claim 1.
  8.  前記モデル後処理手段は、前記複数の画像の1つの画像中に存在する前記物体検出モデルの出力結果の1つないし複数の前記検出物体毎に、検出不能と疑似検出を含むゼロないし複数の前記第一の検出枠を含む位置情報と前記第一の尤度情報とに対して、該第一の尤度情報に対する任意の閾値T(Tは、任意の小数)と、相互の該第一の検出枠を含む位置情報の領域がどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(Uは、任意の小数)とにより前記検出物体毎に最尤の前記第二の検出枠を含む位置情報と前記第二の尤度情報とに補正する個体識別手段を有する、
     請求項1に記載の性能指標化装置。
    The model post-processing means detects zero or a plurality of detected objects, including undetectable and pseudo-detected objects, for each of one or more detected objects of the output results of the object detection model existing in one of the plurality of images. For the position information including the first detection frame and the first likelihood information, an arbitrary threshold T (T is an arbitrary decimal number) for the first likelihood information and a mutual For each detected object, the maximum likelihood of comprising individual identification means for correcting position information including a second detection frame and the second likelihood information;
    The performance indexing device according to claim 1.
  9.  前記モデル後処理手段は、
     前記検出物体毎に正解となる検出枠を含む位置情報とクラス識別情報とが存在する場合は、前記各種加工パラメータの内容にしたがって該正解となる検出枠を含む位置情報を補正する機能を有し、
     前記複数の画像の1つの画像中に存在する前記物体検出モデルの出力結果の1つないし複数の前記検出物体毎に、検出不能と疑似検出を含むゼロないし複数の前記第一の検出枠を含む位置情報と前記第一の尤度情報とに対して、該第一の尤度情報に対する任意の閾値T(Tは、任意の小数)と、該正解となる検出枠を含む位置情報及び該第一の検出枠を含む位置情報の領域とがどれぐらい重なっているかを表す指標であるIOU(Intersection over Union)値に対する任意の閾値U(Uは、任意の小数)とにより前記検出物体毎に最尤の前記第二の検出枠を含む位置情報と前記第二の尤度情報とに補正する個体識別手段を有する、
     請求項1に記載の性能指標化装置。
    The model post-processing means includes:
    If there is position information and class identification information including a correct detection frame for each detection object, the present invention has a function of correcting the position information including the correct detection frame according to the contents of the various processing parameters. ,
    Each of one or more detected objects of the output results of the object detection model existing in one image of the plurality of images includes zero or more first detection frames including undetectable and false detection. For the position information and the first likelihood information, an arbitrary threshold T (T is an arbitrary decimal number) for the first likelihood information, the position information including the correct detection frame, and the first likelihood information. An arbitrary threshold value U (U is an arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index showing how much the region of position information including one detection frame overlaps, is used to determine the maximum value for each detected object. comprising individual identification means for correcting position information including the second likelihood detection frame and the second likelihood information;
    The performance indexing device according to claim 1.
  10.  前記モデル後処理手段は、前記モデル前処理手段の前記複数の画像の加工に使用した前記各種加工パラメータと、前記個体識別手段の出力結果とを、前記検出物体毎に個別に紐づけて、前記ロバスト性検証手段に出力する、
     請求項8に記載の性能指標化装置。
    The model post-processing means individually links the various processing parameters used in the processing of the plurality of images by the model pre-processing means and the output results of the individual identification means for each detected object, and Output to robustness verification means,
    The performance indexing device according to claim 8.
  11.  前記モデル前処理手段は、(i)及び(ii)の少なくとも一方を行い、
     前記(i)では、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、Sピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M個の位置シフト画像を生成し、
     前記(ii)では、前記モデル前処理手段は、記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、L種類(Lは、任意の整数)の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、該画像をSピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M×L個の位置シフト画像を生成し、
     前記ロバスト性検証手段は、前記モデル後処理手段の出力結果である前記第二の検出枠を含む位置情報と前記第二の尤度情報の中の尤度とをもとに、前記各種加工パラメータ別に、前記検出物体毎の前記位置シフトに伴うバラツキを示す尤度分布と、該尤度の有効領域の平均値である平均尤度と、該尤度のヒストグラムと、該尤度の有効領域の標準偏差である尤度の標準偏差と、該尤度の有効領域の最大値である最大尤度と、該尤度の有効領域の最小値である最小尤度と、該尤度に対するIOU値との少なくとも1つを算出する確率統計演算手段を備える、
     請求項2~10のいずれか1項に記載の性能指標化装置。
    The model preprocessing means performs at least one of (i) and (ii),
    In (i), the model preprocessing means, when processing the plurality of images to be input to the object detection model, uses S pixel steps (S is an arbitrary decimal number) in the horizontal direction as the various processing parameters. Generate a total of N×M position shifted images by using position shifts N times (N is any integer) in the vertical direction and M times (M is any integer) in the vertical direction,
    In the above (ii), the model preprocessing means processes L types (L is any integer) of arbitrary magnifications as the various processing parameters when processing the plurality of images to be input to the written object detection model. is used to generate an enlarged or reduced image, and then the image is scaled in S pixel steps (S is any decimal number) by N horizontal steps (N is any integer) and M steps vertically (M is an arbitrary integer) to generate a total of N×M×L position-shifted images,
    The robustness verification means determines the various processing parameters based on the position information including the second detection frame, which is an output result of the model post-processing means, and the likelihood in the second likelihood information. Separately, a likelihood distribution indicating the variation accompanying the position shift for each detection object, an average likelihood that is the average value of the effective area of the likelihood, a histogram of the likelihood, and an effective area of the likelihood. The standard deviation of the likelihood, which is the standard deviation, the maximum likelihood, which is the maximum value of the valid area of the likelihood, the minimum likelihood, which is the minimum value of the valid area of the likelihood, and the IOU value for the likelihood. comprising probability statistical calculation means for calculating at least one of
    The performance indexing device according to any one of claims 2 to 10.
  12.  前記モデル前処理手段は、(i)及び(ii)の少なくとも一方を行い、
     前記(i)では、前記モデル前処理手段は、前記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、Sピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M個の位置シフト画像を生成し、
     前記(ii)では、前記モデル前処理手段は、記物体検出モデルに入力する前記複数の画像を加工するに際して、前記各種加工パラメータとして、L種類(Lは、任意の整数)の任意の倍率を使用して拡大もしくは縮小した画像を生成した後、該画像をSピクセルステップ(Sは、任意の小数)で、水平方向にN回分(Nは、任意の整数)、垂直方向にM回分(Mは、任意の整数)の位置シフトを使用して、合計N×M×L個の位置シフト画像を生成し、
     前記ロバスト性検証手段は、前記検出物体毎に正解となる検出枠を含む位置情報と正解となるクラス識別情報とが存在する場合は、前記モデル後処理手段の出力結果である前記第二の検出枠を含む位置情報と該正解となる検出枠を含む位置情報とのIOU値と前記第二の尤度情報の中のクラス識別情報と該正解となるクラス識別情報から算出されたクラス識別正解率とをもとに、前記各種加工パラメータ別に、該IOU値と該クラス識別正解率とに対する前記検出物体毎の前記位置シフトに伴うバラツキを示すIOU分布とクラス識別正解率分布と、該IOU値と該クラス識別正解率との有効領域の平均値である平均IOU値と平均クラス識別正解率と、該IOU値のヒストグラムと該クラス識別正解率のヒストグラムと、該IOU値と該クラス識別正解率との有効領域の標準偏差であるIOU値の標準偏差とクラス識別正解率の標準偏差と、該IOU値と該クラス識別正解率との有効領域の最大値である最大IOU値と最大クラス識別正解率と、該IOU値と該クラス識別正解率との有効領域の最小値である最小IOU値と最小クラス識別正解率との少なくとも1つを算出する確率統計演算手段を備える、
     請求項2~10のいずれか1項に記載の性能指標化装置。
    The model preprocessing means performs at least one of (i) and (ii),
    In (i), the model preprocessing means, when processing the plurality of images to be input to the object detection model, uses S pixel steps (S is an arbitrary decimal number) in the horizontal direction as the various processing parameters. Generate a total of N×M position shifted images by using position shifts N times (N is any integer) in the vertical direction and M times (M is any integer) in the vertical direction,
    In the above (ii), the model preprocessing means processes L types (L is any integer) of arbitrary magnifications as the various processing parameters when processing the plurality of images to be input to the written object detection model. is used to generate an enlarged or reduced image, and then the image is scaled in S pixel steps (S is any decimal number) by N horizontal steps (N is any integer) and M steps vertically (M is an arbitrary integer) to generate a total of N×M×L position-shifted images,
    If there is position information including a correct detection frame and correct class identification information for each detected object, the robustness verification means detects the second detection which is the output result of the model post-processing means. A class identification accuracy rate calculated from the IOU value of the position information including the frame and the position information including the correct detection frame, the class identification information in the second likelihood information, and the correct class identification information. Based on the above, for each of the various processing parameters, an IOU distribution and a class identification accuracy rate distribution showing the variation due to the position shift for each detected object with respect to the IOU value and the class identification accuracy rate, and the IOU value and the class identification accuracy rate distribution. The average IOU value, which is the average value of the effective area with the class identification correct answer rate, the average class identification correct answer rate, a histogram of the IOU value, a histogram of the class identification correct answer rate, and the IOU value and the class identification correct answer rate. The standard deviation of the IOU value, which is the standard deviation of the effective area, and the standard deviation of the class identification correct answer rate, and the maximum IOU value and the maximum class identification correct answer rate, which are the maximum values of the effective area of the IOU value and the class identification correct answer rate. and a probability statistical calculation means for calculating at least one of the minimum IOU value and the minimum class identification accuracy rate, which are the minimum values in the effective area of the IOU value and the class identification accuracy rate,
    The performance indexing device according to any one of claims 2 to 10.
  13.  前記ロバスト性検証手段は、前記各種加工パラメータ別に、前記検出物体毎の前記尤度分布における任意の閾値以下となる位置もしくは領域の抽出と、前記平均尤度が任意の閾値以下となる該検出物体の抽出と、前記尤度の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記最大尤度が任意の閾値以下となる該検出物体の抽出と、前記最小尤度が任意の閾値以下となる該検出物体の抽出と、前記IOU値が任意の閾値以下となる該検出物体の抽出との少なくとも1つを行う学習強化必要項目抽出手段を有する、
     請求項11に記載の性能指標化装置。
    The robustness verification means extracts, for each of the various processing parameters, a position or region where the likelihood distribution for each of the detected objects is below an arbitrary threshold, and the detected object where the average likelihood is below an arbitrary threshold. extraction of the detected object for which the standard deviation of the likelihood is greater than or equal to an arbitrary threshold; extraction of the detected object for which the maximum likelihood is less than or equal to an arbitrary threshold; comprising a learning reinforcement necessary item extraction means that performs at least one of the following: extracting the detected object whose IOU value is equal to or less than an arbitrary threshold;
    The performance indexing device according to claim 11.
  14.  前記ロバスト性検証手段は、前記各種加工パラメータ別に、前記検出物体毎の前記IOU分布における任意の閾値以下となる位置もしくは領域の抽出と、前記クラス識別正解率分布における任意の閾値以下となる位置もしくは領域の抽出と、前記平均IOU値が任意の閾値以下となる該検出物体の抽出と、前記平均クラス識別正解率が任意の閾値以下となる該検出物体の抽出と、前記IOU値の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記クラス識別正解率の標準偏差が任意の閾値以上となる該検出物体の抽出と、前記最大IOU値が任意の閾値以下となる該検出物体の抽出と、前記最大クラス識別正解率が任意の閾値以下となる該検出物体の抽出と、前記最小IOU値が任意の閾値以下となる該検出物体の抽出と、前記最小クラス識別正解率が任意の閾値以下となる該検出物体の抽出との少なくとも1つを行う学習強化必要項目抽出手段を有する、
     請求項12に記載の性能指標化装置。
    The robustness verification means extracts, for each of the various processing parameters, a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and a position or region that is equal to or less than an arbitrary threshold value in the class identification accuracy rate distribution. extraction of a region, extraction of the detected object for which the average IOU value is below an arbitrary threshold, extraction of the detected object for which the average class classification accuracy rate is below an arbitrary threshold, and the standard deviation of the IOU value is extraction of the detected object whose standard deviation of the class classification accuracy rate is greater than or equal to an arbitrary threshold; and extraction of the detected object whose maximum IOU value is equal to or less than an arbitrary threshold. extraction of the detected object for which the maximum class identification accuracy rate is below an arbitrary threshold; extraction of the detected object for which the minimum IOU value is below an arbitrary threshold; comprising a learning reinforcement necessary item extraction means that performs at least one of the following: extracting the detected object whose value is equal to or less than a threshold;
    The performance indexing device according to claim 12.
  15.  前記ロバスト性検証手段の前記確率統計演算手段、および、前記学習強化必要項目抽出手段は、前記尤度と前記IOU値と前記クラス識別正解率とをもとにした確率統計演算の際に、対象となる検出物体に関係する画素が任意の割合で欠落している画像に対しては、演算対象から除外するような機能を有する、
     請求項14に記載の性能指標化装置。
    The probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item extraction means perform a probability statistical calculation based on the likelihood, the IOU value, and the class identification correct answer rate. It has a function to exclude from the calculation target images in which pixels related to the detected object are missing at an arbitrary rate.
    The performance indexing device according to claim 14.
  16.  前記確率統計演算手段の出力に基づき分析した結果、前記モデル学習辞書が性能不十分であると判断した場合は、前記学習強化必要項目抽出手段の結果に基づいて、学習画像を準備して、内蔵もしくは外部の辞書学習手段により前記モデル学習辞書を再学習する、
     請求項13に記載の性能指標化装置。
    As a result of analysis based on the output of the probability statistical calculation means, if it is determined that the performance of the model learning dictionary is insufficient, a learning image is prepared based on the result of the learning reinforcement necessary item extraction means and the built-in or relearning the model learning dictionary using an external dictionary learning means;
    The performance indexing device according to claim 13.
  17.  前記物体検出モデルは、深層学習により作成されたモデル学習辞書を含むニューラルネットワークであることを特徴とする、
     請求項1に記載の性能指標化装置。
    The object detection model is a neural network including a model learning dictionary created by deep learning.
    The performance indexing device according to claim 1.
  18.  画像を取得して適切に加工する画像処理ステップと、
     前記画像処理ステップにより取得された画像を各種加工パラメータに従って複数の画像に加工するモデル前処理ステップと、
     前記モデル前処理ステップで加工された前記複数の画像の入力に対して物体位置と尤度とを推論するモデル学習辞書を含む物体検出モデルと、
     前記物体検出モデルの推論結果をもとに前記複数の画像の検出物体毎に第一の検出枠を含む位置情報と第一の尤度情報とを適正な値である第二の検出枠を含む位置情報と第二の尤度情報とに補正するモデル後処理ステップと、
     前記モデル後処理ステップの出力結果である第二の検出枠を含む位置情報と第二の尤度情報と前記各種加工パラメータとをもとに前記物体検出モデルのロバスト性を検証するロバスト性検証ステップとを含む、
     性能指標化方法。
    an image processing step of acquiring and appropriately processing the image;
    a model preprocessing step of processing the image obtained in the image processing step into a plurality of images according to various processing parameters;
    an object detection model including a model learning dictionary that infers object positions and likelihoods with respect to the input of the plurality of images processed in the model preprocessing step;
    Based on the inference result of the object detection model, for each detected object in the plurality of images, position information including the first detection frame and first likelihood information are set to include a second detection frame having appropriate values. a model post-processing step for correcting position information and second likelihood information;
    a robustness verification step of verifying the robustness of the object detection model based on the position information including the second detection frame that is the output result of the model post-processing step, the second likelihood information, and the various processing parameters; including
    Performance index method.
  19.  請求項18に記載の性能指標化方法をコンピュータに実行させるための
     プログラム。
    A program for causing a computer to execute the performance indexing method according to claim 18.
PCT/JP2023/012736 2022-03-31 2023-03-29 Performance indexing device, performance indexing method, and program WO2023190644A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022-059640 2022-03-31
JP2022059640 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023190644A1 true WO2023190644A1 (en) 2023-10-05

Family

ID=88202015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/012736 WO2023190644A1 (en) 2022-03-31 2023-03-29 Performance indexing device, performance indexing method, and program

Country Status (1)

Country Link
WO (1) WO2023190644A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002208011A (en) * 2001-01-12 2002-07-26 Fujitsu Ltd Image collation processing system and its method
CN113255526A (en) * 2021-05-28 2021-08-13 华中科技大学 Momentum-based confrontation sample generation method and system for crowd counting model
JP2021162892A (en) * 2020-03-30 2021-10-11 株式会社日立製作所 Evaluation device, evaluation method and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002208011A (en) * 2001-01-12 2002-07-26 Fujitsu Ltd Image collation processing system and its method
JP2021162892A (en) * 2020-03-30 2021-10-11 株式会社日立製作所 Evaluation device, evaluation method and storage medium
CN113255526A (en) * 2021-05-28 2021-08-13 华中科技大学 Momentum-based confrontation sample generation method and system for crowd counting model

Similar Documents

Publication Publication Date Title
US11798132B2 (en) Image inpainting method and apparatus, computer device, and storage medium
CN112967243B (en) Deep learning chip packaging crack defect detection method based on YOLO
Kundu et al. No-reference quality assessment of tone-mapped HDR pictures
CN110992238B (en) Digital image tampering blind detection method based on dual-channel network
CN109753878B (en) Imaging identification method and system under severe weather
CN109191424B (en) Breast mass detection and classification system and computer-readable storage medium
CN110956126A (en) Small target detection method combined with super-resolution reconstruction
CN111209858B (en) Real-time license plate detection method based on deep convolutional neural network
CN110751195B (en) Fine-grained image classification method based on improved YOLOv3
CN111242026B (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN110807362A (en) Image detection method and device and computer readable storage medium
CN111768415A (en) Image instance segmentation method without quantization pooling
CN115830004A (en) Surface defect detection method, device, computer equipment and storage medium
CN111524113A (en) Lifting chain abnormity identification method, system, equipment and medium
CN113516697B (en) Image registration method, device, electronic equipment and computer readable storage medium
CN111597845A (en) Two-dimensional code detection method, device and equipment and readable storage medium
WO2023190644A1 (en) Performance indexing device, performance indexing method, and program
CN115861922B (en) Sparse smoke detection method and device, computer equipment and storage medium
CN114372941B (en) Low-light image enhancement method, device, equipment and medium
CN111126187A (en) Fire detection method, system, electronic device and storage medium
CN113256528B (en) Low-illumination video enhancement method based on multi-scale cascade depth residual error network
CN112991236B (en) Image enhancement method and device based on template
Chaczko et al. A preliminary investigation on computer vision for telemedicine systems using OpenCV
CN112699898A (en) Image direction identification method based on multi-layer feature fusion
WO2024071347A1 (en) Object detecting device, object detecting method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23780653

Country of ref document: EP

Kind code of ref document: A1