WO2023190644A1

WO2023190644A1 - Performance indexing device, performance indexing method, and program

Info

Publication number: WO2023190644A1
Application number: PCT/JP2023/012736
Authority: WO
Inventors: 洋一小倉; 晋也松山; 健志緑川; 直大岩橋; 肇片山
Original assignee: ヌヴォトンテクノロジージャパン株式会社
Priority date: 2022-03-31
Filing date: 2023-03-29
Publication date: 2023-10-05

Abstract

This performance indexing device comprises: an image processing means that acquires and processes an image; a model preprocessing means that processes an acquired image into a plurality of images in accordance with various processing parameters; an object detection model that includes a model learning dictionary for inferring an object position and a likelihood with respect to the input of the processed plurality of images; a model post-processing means that, for each detection object of the plurality of images, corrects position information that includes a first detection frame, and first likelihood information, so as to obtain position information that includes a second detection frame, and second likelihood information, on the basis of an inference result from the object detection model; and a robustness verification means for verifying the robustness of the object detection model, on the basis of the various processing parameters, and the position information that includes the second detection frame and the second likelihood information that are the output from the model post-processing means.

Description

Performance indexing device, performance indexing method, and program

The present invention relates to a performance indexing device, a performance indexing method, and a program for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies.

In recent years, edge AI and cloud AI equipped with AI functions have become rapidly popular. AI (artificial intelligence) is a model of neurons in the human brain, and a wide variety of models have been developed to detect objects from images.

As an example, when initially training or relearning a model learning dictionary of an object detection model using image data and learning parameters that have been inflated (extended data) by Augmentation, the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see Patent Document 1).

JP 2021-111228 Publication

However, with conventional performance indexing devices, performance indexing methods, and programs for object detection models, when a model learning dictionary is learned by deep learning etc., the versatility and robustness against various fluctuation conditions are insufficiently improved. There were cases where it became. The present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. The purpose is to provide devices, performance indexing methods, and programs.

A performance indexing device according to one aspect of the present invention is a performance indexing device for an object detection model, and includes an image processing unit that acquires an image and processes it appropriately, and a system that processes various types of images acquired by the image processing unit. an object detection model including a model preprocessing means for processing a plurality of images according to processing parameters; and a model learning dictionary for inferring an object position and likelihood with respect to input of the plurality of images processed by the model preprocessing means; , based on the inference result of the object detection model, position information and first likelihood information including the first detection frame for each detected object in the plurality of images are set to a second detection frame having an appropriate value. a model post-processing means for correcting the position information including the second likelihood information and the second likelihood information, and the position information including the second detection frame which is the output result of the model post-processing means, the second likelihood information and the various types of and robustness verification means for verifying the robustness of the object detection model based on the processing parameters.

Further, the performance indexing method according to one aspect of the present invention includes an image processing step of acquiring and appropriately processing an image, and a model of processing the image acquired in the image processing step into a plurality of images according to various processing parameters. a preprocessing step; an object detection model including a model learning dictionary that infers object positions and likelihoods based on the input of the plurality of images processed in the model preprocessing step; and an inference result of the object detection model. Based on the position information including the first detection frame and the first likelihood information for each detection object in the plurality of images, the position information including the second detection frame and the second likelihood, which are appropriate values. a model post-processing step for correcting information, and detecting the object based on position information including a second detection frame that is an output result of the model post-processing step, second likelihood information, and the various processing parameters. and a robustness verification step of verifying the robustness of the model.

Further, a program according to one aspect of the present invention is a program for causing a computer to execute the performance indexing method described above.

Note that these comprehensive or specific aspects may be realized by a system, a device, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, and the system, device, integrated circuit, computer program and a recording medium may be used in any combination.

According to the present invention, there is provided a performance indexing device, a performance indexing method, and a program for accurately analyzing the performance of a model that detects an object in an image, and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. is provided.

1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention; FIG. FIG. 2 is a diagram showing the configuration of an artificial neuron model. FIG. 2 is a diagram illustrating the configuration of a YOLO model according to an embodiment. FIG. 3 is a diagram illustrating the operating principle of the YOLO model according to an embodiment. It is a figure showing the calculation concept of the IOU value in object detection. FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention. FIG. 6 is a diagram showing a flowchart of the individual identification means of the model post-processing means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the individual identification means of the model post-processing means according to the embodiment of the present invention. FIG. 1 is a diagram illustrating problems of a conventional object detection model performance indexing device. FIG. 2 is a second diagram illustrating problems with a conventional object detection model performance indexing device. FIG. 6 is a diagram illustrating the operation of the position shifting function of the model preprocessing means according to an embodiment of the invention. FIG. 6 is a diagram illustrating the operation of the resizing function of the model preprocessing means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the probability statistical calculation means of the robustness verification means according to the embodiment of the present invention. FIG. 6 is a diagram showing the operation of the tone conversion function of the model preprocessing means according to the embodiment of the present invention. FIG. 6 is a diagram illustrating the operation of the aspect ratio changing function of the model preprocessing means according to the embodiment of the present invention. FIG. 6 is a diagram illustrating the operation of the rotation function of the model preprocessing means according to an embodiment of the invention. 1 is a diagram showing a performance indexing device for an object detection model according to an embodiment of the present invention; FIG. FIG. 2 is a diagram showing a conventional object detection model performance indexing device. FIG. 2 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention.

(Knowledge that formed the basis of disclosure)
In recent years, edge AI and cloud AI equipped with AI functions have become rapidly popular. AI (artificial intelligence) is a model of neurons in the human brain, and a wide variety of models have been developed to detect objects from images. For example, in the case of humans, it is common to detect the position of a target object from eye information (image) and perform class identification to determine which class, such as a person or a vehicle, the object belongs to. CNN (Convolutional Neural Network), which is a convolutional neural network, is often used as an object detection model. ) After preparing a large amount of training data, for example, in a problem of classifying objects or backgrounds using gradient descent, binary cross entropy is used as an error function, and for regression problems of deviations from the ground truth BBox. The mainstream is an end-to-end learning method using deep learning, which uses the L1 norm (absolute value error) as an error function and minimizes all error functions to learn CNN weighting coefficient information (model learning dictionary). FasterR-CNN, EffectiveDet, SSD, and YOLO (You Only Look Once) (for example, see Non-Patent Document 1) are increasingly being used as models for object position detection and class identification. are doing.

In addition, as a means of checking the performance of an object detection model, in the case of YOLO, which is one of the object detection models mentioned above, one of the indicators indicating the detection reliability of the target object is the reliability shown in (Equation 1) below. There is a degree score (for example, see Non-Patent Document 1). The confidence score is sometimes commonly referred to as likelihood.

Reliability score (likelihood) = Pr(Classi|Object)×Pr(Object)×IOUTtruthpred (Formula 1)

Here, Pr(Class|Object) indicates the class probability to which class the Object (target object) belongs, and the sum of all class probabilities is "1". Pr(Object) indicates the probability that an Object is included in a BoundingBox (hereinafter referred to as BBox). IOUTtruth pred is an index indicating how much the two frame areas of ground truth BBox, which is the correct frame information, and BBox predicted (inferred) by a model such as YOLO overlap, and IOU( Intersection Over Union) value.

IOU = Area of Union ÷ Area of Intersection (Formula 2)

Here, Area of Union is the area of the union of the two frame areas to be compared. Area of Intersection is the area of the common portion of the two frame regions to be compared.

For example, when inferring an image taken with a camera using YOLO, including a deep learning model learning dictionary, there is no training data such as groundtruth BBox, which is the correct answer frame, so IOUTtruthpred is set to "1". The calculated result is sometimes referred to as likelihood (reliability score). Using this likelihood, for example, it is possible to index the detection accuracy and detection performance of a detection target in an image taken with a camera. In addition, by creating training data that adds groundtruth BBox, which is a correct answer frame, to the captured image, it becomes possible to calculate the original reliability score (likelihood) and IOU value, so the detection in the image It is possible to index the detection accuracy and detection performance of an object detection model including a model learning dictionary for the target.

Additionally, mAP (mean average precision) and AP (average precision) are often used as indicators for comparing the accuracy and performance of object detection. (For example, see Non-Patent Document 2)

mAP and AP in object detection are calculated by the following method.

Prepare validation data with groundtruth BBox, which is a correct answer frame, added to the detected object that is the target of multiple image data, and calculate PredictedBBox (predicted BBox) as a result of inference (prediction) using the object detection model. Compare and calculate IOU value. At that time, among all the prediction results of validation data, Precision, which indicates the percentage of correctly predicted IOUs with an arbitrary threshold value or more, and among the actual correct results, IOUs with IOUs greater than an arbitrary threshold value and positions close to the correct results. Calculate Recall indicating the rate at which BBox was predicted. At that time, for the validation data, the Precision and Recall values are calculated from "0", which is the minimum probability, to "1", which is the maximum probability that the above-mentioned Object is included in the BBox for each class to be identified. The sum of the areas of the two-dimensional graph is calculated as AP, and the average of the APs calculated for all identification classes is calculated as mAP. In addition to indexing the average detection accuracy and detection performance of the object detection model including the model learning dictionary for the detection target in the image, it can also be used as a performance index for various types of robustness, although it depends on the method of selecting validation data. is also often used.

FIG. 17 is a block diagram showing a conventional performance indexing device for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.

The image processing means 100 that acquires and appropriately processes images receives light emitted from a lens (for example, standard zoom, wide-angle zoom, fisheye) and an object passing through the lens, and converts the brightness of the light into electrical information. The image sensor, which is a device that converts into Equipped with an image processing processor equipped with a correction function and a local tone mapping function, it makes it easy to see or find the object to be detected while absorbing time-series fluctuation conditions such as illuminance in the shooting environment. Perform image processing.

The image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc.

On the other hand, in order to perform object detection using the object detection model 300, image data generated by the image processing means 100 is input to the model preprocessing means 200, and processed so that it becomes an image suitable for input to the object detection model 300. do. Note that the model preprocessing means 200 may be configured with an electronic circuit, or may be realized by an image processing processor 290 configured with an affine transformation function 291, a projective transformation function 292 (library), and a CPU or an arithmetic processor. be.

The image processed by the model preprocessing means 200 is input to the object detection model 300, and by inference (prediction), it is detected where the target object is, and whether the object is a person or a vehicle. It is identified whether it corresponds to a class (class identification). As a result, from the object detection model 300, for each detected object existing in one image, position information 301 including zero or multiple first detection frames including undetectable and false detection, and first likelihood information 302 is output. Here, the position information 301 including the first detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the first likelihood information 302 is, for example, These are the likelihood and class identification information that indicate detection accuracy.

The object detection model 300 includes, for example, a model learning dictionary 320 and a deep neural network (DNN) model 310 using a convolutional neural network (CNN). The DNN model 310 may use, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with high superiority in detection processing speed. Furthermore, when priority is given to detection accuracy, FasterR-CNN, EfficientDet, or the like may be used, for example. Furthermore, when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used. The model learning dictionary 320 is a collection of weighting coefficient data of the DNN model 310, and in the case of the DNN model 310, it is initially learned or re-learned by the deep learning means 640.

For each detected object existing in one image output from the object detection model 300, position information 301 including zero or multiple first detection frames including undetectable and false detection and first likelihood information 302 are , after inputting it to the model post-processing means 400, the most appropriate one for each detection object is selected by sorting the position information 301 including the first detection frame based on mutual IOU values, determining the maximum of the first likelihood information 302, etc. The position information 401 including the possible second detection frame and the second likelihood information 402 are corrected and sent to a display and data storage means 120 such as a monitor, an external memory such as a PC (personal computer), a cloud server, etc. be done. Here, the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.

A series of means for generating position information 401 including the second detection frame and second likelihood information 402 by these image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This is a first performance indexing device 30 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.

Next, an example of deep learning for creating the model learning dictionary 320 will be described.

First, learning material data considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data for deep learning such as large-scale open source datasets are stored. Note that, as material data for learning, necessary images depending on the purpose of use are utilized, for example, image data that is displayed from the image processing means 100 using the image output control means 110 and stored in the data storage means 120. In some cases.

Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data.

Next, the supervised data generated by the annotation means 620 is augmented by the augmentation means 630 as a learning image 631 in order to enhance versatility and robustness.

Next, the learning image 631 is input to the deep learning means 640, the weighting coefficient of the DNN model 310 is calculated, and the calculated weighting coefficient is converted into, for example, ONNX format to create the model learning dictionary 320. By reflecting the model learning dictionary 320 in the object detection model 300, it becomes possible to detect the position of the object in the image and identify the class.

Next, an example of the second performance indexing device 40 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class will be described.

Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610. Validation material data is an image output from a large-scale open source dataset or image processing means 100, for example, to verify the detection accuracy, detection performance, versatility, and robustness required for the purpose of use. In some cases, the control means 110 is used to display the image data and the image data stored in the data storage means 120 is utilized.

Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623.

Next, the validation data 623 is input to the first mAP calculation means 660 that is capable of inference (prediction) equivalent to that of the object detection model 300, and the mAP is calculated as a result of inference (prediction) with the groundtruth BBox that is the correct answer frame. Calculation of IOU value 653 by comparing Predicted BBox (predicted BBox), and calculation of Precision 654 which indicates the percentage of all prediction results for all validation data 623 where IOU value 653 was correctly predicted at or above an arbitrary threshold value. Then, among the actual correct results, calculation of Recall 655, which indicates the proportion of BBoxes near the correct result whose IOU value 653 is greater than or equal to an arbitrary threshold, can be predicted, and the above-mentioned object detection accuracy and performance are compared. AP (Average Precision) value 651 for each class as an index and mAP (mean Average Precision) value 652 averaged over all classes are calculated (for example, see Non-Patent Document 2). Here, for example, when YOLO is applied to the DNN model 310, the first mAP calculation means 660 is equipped with an open source inference environment called darknet and an arithmetic processor (including a personal computer and a supercomputer). It is desirable that the object detection model 300 has the same inference (prediction) performance as the object detection model 300. Furthermore, it is provided with means for calculating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 described above.

A series of means for generating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 by the learning material database storage means 610, the annotation means 620, and the first mAP calculation means 660 is the This is a second performance indexing device 40 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that performs position detection and class identification.

On the other hand, when initially training or relearning a model learning dictionary of an object detection model using image data and learning parameters that have been inflated (extended data) by Augmentation, the quality of the augmented data is required for the training data. If the quality is not met, the augmented data becomes noise and may reduce the quality and efficiency of learning. Therefore, the editing parameters of multiple learning data obtained by editing the original data representing the judgment target are A means for determining each data, a means for generating a plurality of learning data each representing a determination target from the original data based on the parameters, and a means for learning a model using each of the plurality of learning data. , a method for improving the quality of extended data for learning has been proposed (see, for example, Patent Document 1).

However, conventional performance indexing devices, methods, and programs for object detection models have problems with accuracy and performance during inference that the neural network itself, including the DNN model in the object detection model, potentially has, and deep learning It has been difficult to accurately identify weaknesses in versatility and robustness against various fluctuating conditions and reinforcement policies caused by model learning dictionaries created by such methods. Therefore, this becomes a cause of poor performance of the neural network and failure to enhance the versatility and robustness of the model learning dictionary.

Versatility and robustness items and various fluctuation conditions for a model that detects objects in images acquired by a camera etc. include the background (scenery), camera lens specifications, image size, such as the height and elevation/depression angle at which the camera is mounted. Detection target area and field of view, dewarp processing method when using a fisheye lens, special conditions such as changes in illuminance depending on sunlight and lighting, crushed shadows, blown highlights, backlight, etc., sunny, cloudy, rain, snow , weather conditions such as fog, position (left, right, top, bottom, and depth) of the target detection object in the image, size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, and mutual overlap. These include the state, the type, size, and position of the attached object, whether or not the lens has IR cut, the moving speed of the object to be detected, and the moving speed of the camera itself.

As a first problem, the image processing means 100, model preprocessing means 200, object detection model 300, and model postprocessing means 400 shown in FIG. When indexing the performance of an object detection model using a series of first performance indexing devices 30, a performance indexing method (hereinafter sometimes simply referred to as a method), and a program that generates an image, When the position and size of the detection target fluctuates over time, even though the same object is being detected, the detection frame is inferred (predicted) due to the configuration conditions of the DNN model and issues caused by the algorithm. Variations may occur in specific patterns in location information and likelihood information, including This phenomenon occurs when cameras for detecting objects are made smaller, consume less power, and reduce costs due to limitations in the performance of arithmetic processors such as DSPs (digital signal processors) installed in DNN models. This problem is thought to be particularly noticeable when the input image size is reduced. For example, when using a one-stage DNN model such as YOLO, which is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class, as shown in Figure 10, If you check the likelihood distribution for the position of the detected object by inference for multiple images created by shifting the position of the object in the image in units of several pixels in the horizontal and vertical directions, the detection Depending on the location of the object, there may be locations where the likelihood decreases in a unique grid pattern. For example, in the case of YOLO, this occurs because the area is divided into grid cells of arbitrary size and class probabilities are calculated in order to detect the object position and identify the class at the same time, as shown in Figure 3B. This is considered a potential issue. On the other hand, when using a two-stage DNN model such as EfficientDet, which processes object position detection and class identification in two stages, it does not have as many issues as the one-stage DNN model mentioned above. Although this is unlikely to occur in many cases, it may be difficult to apply depending on the intended use because the detection speed decreases.

For this reason, performance indexing devices, methods, and programs that infer (predict) position information and likelihood information including the detection frame using only the original image (pinpoint) for the image are difficult to grasp with accurate detection accuracy and performance. As a result, it was sometimes impossible to identify potential problems with the neural network itself or to formulate solutions. Furthermore, it becomes insufficient to understand the weaknesses of the model learning dictionary, which is one of the components of the object detection model, and the conditions that require reinforcement. Therefore, when a model learning dictionary is learned by deep learning or the like, improvements in versatility and robustness against various fluctuation conditions may not be sufficient.

As a second problem, a series of steps to generate an IOU value 653, a Precision 654, a Recall 655, an AP value 651, and an mAP value 652 using the learning material database storage means 610, annotation means 620, and first mAP calculation means 660 shown in FIG. The second performance indexing device 40, method, and program are used to analyze the robustness and reinforcement policy of a model learning dictionary for a model that detects the position and class of an object in an image. When indexing performance, it is possible to understand the overall and average detection accuracy and detection performance for the validation data selected for verification, but it is important to understand in detail the versatility and robustness against various fluctuation conditions. I can't. In addition, for the first mAP calculation means 660 in FIG. 17, the first When the performance indexing device 30, method, and program of Conditions that require reinforcement will not be fully understood. Therefore, when a model learning dictionary is learned by deep learning or the like, improvements in versatility and robustness against various fluctuation conditions may not be sufficient.

The present invention has been made in view of the above-mentioned problems, and is a performance index for accurately analyzing the performance of a model that detects objects in images and weaknesses in the versatility and robustness of a model learning dictionary, as well as reinforcement policies. The purpose is to provide devices, methods, and programs. Furthermore, in order to reduce the size, power consumption, and cost of cameras for detecting objects, even if performance limitations are placed on the arithmetic processors such as DSPs (digital signal processors) installed, The purpose is to provide a performance indexing device, method, and program to ensure detection accuracy and performance.

(Summary of disclosure)
A performance indexing device according to a first aspect of the present invention is a device that performs performance indexing in an object detection model, and includes an image processing means for acquiring and appropriately processing an image, and an image processing means for acquiring an image and appropriately processing the image. A model preprocessing means for processing an image into a plurality of images according to various processing parameters, and a model learning for inferring an object position and likelihood (degree of certainty) for the plurality of images processed by the model preprocessing means. An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model. model post-processing means for correcting position information including a second detection frame and second likelihood information; and position information including the second detection frame and second likelihood that are output results of the model post-processing unit. The present invention is characterized by comprising a robustness verification means for verifying the robustness of the object detection model based on the information and the various processing parameters.

A performance indexing device according to a second aspect corresponding to an embodiment is the performance indexing device according to the first aspect, wherein the model preprocessing means processes the plurality of images input to the object detection model. When processing, as the various processing parameters, a position shift of N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction is used in S (any decimal) pixel steps. A total of N×M position shifted images are generated.

A performance indexing device according to a third aspect corresponding to a certain embodiment is the performance indexing device according to the first or second aspect, wherein the model preprocessing means When processing an image of In the (pixel) step, a total of N×M×L position-shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction.

A performance indexing device according to a fourth aspect corresponding to an embodiment is the performance indexing device according to any one of the first to third aspects, wherein the model preprocessing means When processing the plurality of images input to the , the brightness level is set to an arbitrary value using P (arbitrary integer) types of contrast correction curves or gradation conversion curves as the various processing parameters. Generate a modified image.

A performance indexing device according to a fifth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to fourth aspects, wherein the model preprocessing means When processing the plurality of images input to the image processing apparatus, Q (arbitrary integer) types of aspect ratios are further used as the various processing parameters to generate images with changed aspect ratios.

A performance indexing device according to a sixth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to fifth aspects, wherein the model preprocessing means When processing the plurality of images that are input to the image processing apparatus, R (arbitrary integer) types of angles are further used as the various processing parameters to generate images with changed rotation angles.

A performance indexing device according to a seventh aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to sixth aspects, wherein the model preprocessing means When processing the plurality of images input to the computer, an image is generated by pasting the average luminance level of the effective images in a blank space where no effective images are generated due to the processing.

A performance indexing device according to an eighth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to seventh aspects, wherein the model post-processing means For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and Indicates how much an arbitrary threshold T (arbitrary decimal number) for the first likelihood information overlaps with the region of position information including the first detection frame with respect to the first likelihood information. Correct the position information including the second detection frame with the maximum likelihood and the second likelihood information for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index. It is characterized by having individual identification means.

A performance indexing device according to a ninth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to eighth aspects, wherein the model post-processing means If there is positional information and class identification information that include a correct detection frame, the function corrects the positional information that includes the correct detection frame according to the contents of the various processing parameters, and For each of one or more detected objects of the output results of the object detection model existing in one image, position information including zero or more first detection frames including undetectable and false detection; and For the first likelihood information, an arbitrary threshold T (arbitrary decimal number) for the first likelihood information, position information including the correct detection frame, and position information including the first detection frame. The maximum likelihood position information including the second detection frame for each detected object is calculated using an arbitrary threshold value U (an arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index showing how much the regions overlap. It is characterized by comprising an individual identification means for correcting the second likelihood information.

A performance indexing device according to a tenth aspect corresponding to an embodiment is the performance indexing device according to the eighth or ninth aspect, wherein the model post-processing means is configured to The various processing parameters used in image processing and the output results of the individual identification means are individually linked for each detected object and output to the robustness verification means.

A performance indexing device according to an eleventh aspect corresponding to a certain embodiment includes any one of the second to tenth aspects that cite the second aspect, or the third to tenth aspects that cite the third aspect. The performance indexing device according to any one aspect, wherein the robustness verification means includes position information including the second detection frame, which is an output result of the model post-processing means, and the second likelihood information. a likelihood distribution indicating the variation accompanying the position shift for each of the detected objects for each of the various processing parameters based on the likelihood in the above, and an average likelihood that is the average value of the effective area of the likelihood; A histogram of the likelihood, the standard deviation of the likelihood which is the standard deviation of the valid area of the likelihood, the maximum likelihood which is the maximum value of the valid area of the likelihood, and the minimum value of the valid area of the likelihood The present invention is characterized by comprising probability statistical calculation means for calculating either or all of the minimum likelihood and the IOU value corresponding to the likelihood.

A performance indexing device according to a twelfth aspect corresponding to a certain embodiment may be any one of the second to eleventh aspects that cite the second aspect, or the third to 11th aspects that cite the third aspect. In the performance indexing device according to any one aspect, if position information including a detection frame that is a correct answer and class identification information that is a correct answer exist for each of the detected objects, the robustness verification means detects the The IOU value of the position information including the second detection frame which is the output result of the model post-processing means, the position information including the correct detection frame, the class identification information in the second likelihood information, and the IOU distribution showing the variation due to the position shift for each detected object with respect to the IOU value and the class identification correct answer rate for each of the various processing parameters, based on the class identification correct answer rate calculated from the class identification information that is the correct answer. and a class identification correct answer rate distribution, an average IOU value and an average class identification correct answer rate that are the average values of the effective area of the IOU value and the class identification correct answer rate, a histogram of the IOU value and a histogram of the class identification correct answer rate, The standard deviation of the IOU value, which is the standard deviation of the effective area of the IOU value and the class identification correct answer rate, the standard deviation of the class identification correct answer rate, and the maximum value, which is the maximum value of the effective area of the IOU value and the class identification correct answer rate. The probability statistical calculation means calculates either or all of the IOU value and the maximum class identification accuracy rate, the minimum IOU value and the minimum class identification accuracy rate that are the minimum values of the effective area of the IOU value and the class identification accuracy rate. It is characterized by having the following.

A performance indexing device according to a thirteenth aspect corresponding to a certain embodiment is a performance indexing device according to the eleventh or twelfth aspect that cites the eleventh aspect, and the robustness verification means further comprises: For each processing parameter, extraction of a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, extraction of the detected object where the average likelihood is below an arbitrary threshold value, and extraction of the detected object where the average likelihood is below an arbitrary threshold value, and Extraction of the detected object whose standard deviation is greater than or equal to an arbitrary threshold, extraction of the detected object whose maximum likelihood is equal to or less than an arbitrary threshold, and extraction of the detected object whose minimum likelihood is equal to or less than an arbitrary threshold. The present invention is characterized by having a learning reinforcement necessary item extracting unit that extracts any or all of the detected objects whose IOU value is equal to or less than an arbitrary threshold value.

A performance indexing device according to a fourteenth aspect corresponding to a certain embodiment is a performance indexing device according to the twelfth or thirteenth aspect that cites the twelfth aspect, and the robustness verification means further comprises: For each of the various processing parameters, extracting a position or area where the IOU distribution for each detected object is below an arbitrary threshold value, extracting a position or area where the class identification accuracy rate distribution is below an arbitrary threshold value, and extracting the average IOU extraction of the detected object whose value is equal to or less than an arbitrary threshold; extraction of the detected object whose average class identification accuracy rate is equal to or less than the arbitrary threshold; and detection of the detected object whose standard deviation of the IOU value is equal to or greater than the arbitrary threshold. extraction of an object, extraction of the detected object for which the standard deviation of the class identification accuracy rate is greater than or equal to an arbitrary threshold, extraction of the detected object for which the maximum IOU value is less than or equal to the arbitrary threshold, and the maximum class identification correct answer. extraction of the detected object for which the rate is below an arbitrary threshold; extraction of the detected object for which the minimum IOU value is below the arbitrary threshold; and extraction of the detected object for which the minimum IOU value is below the arbitrary threshold. The present invention is characterized by having a means for extracting necessary items for learning reinforcement that includes any or all of the extraction methods.

A performance indexing device according to a fifteenth aspect corresponding to a certain embodiment is the performance indexing device according to the fourteenth aspect, in which the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item The extraction means is configured to perform a probability statistical calculation based on the likelihood, the IOU value, and the class classification correct answer rate for an image in which pixels related to the target detection object are missing at an arbitrary rate. It is characterized by having a function to exclude it from calculation targets.

A performance indexing device according to a sixteenth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the thirteenth to fifteenth aspects, which performs analysis based on the output of the probability statistical calculation means. As a result, if it is determined that the performance of the model learning dictionary is insufficient, a learning image is prepared based on the result of the learning reinforcement necessary item extraction means, and the model learning dictionary is used by the built-in or external dictionary learning means. It is characterized by relearning.

A performance indexing device according to a seventeenth aspect corresponding to a certain embodiment is the performance indexing device according to any one of the first to sixteenth aspects, wherein the object detection model is created by deep learning. It is characterized by being a neural network that includes a model learning dictionary.

A performance indexing method according to an eighteenth aspect of the present invention is a method for creating a performance index in an object detection model, which includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image. A model preprocessing step for processing an image into multiple images according to various processing parameters, and model learning for inferring object positions and likelihoods (degrees of certainty) for the multiple images processed in the model preprocessing step. An object detection model including a dictionary, and position information including a first detection frame and first likelihood information for each detected object in the plurality of images are set to appropriate values based on the inference results of the object detection model. a model post-processing step that corrects the position information including the second detection frame and second likelihood information; and the position information including the second detection frame and the second likelihood that are the output results of the model post-processing step. The method is characterized in that it includes a robustness verification step of verifying the robustness of the object detection model based on information and the various processing parameters, and the method executes each of the means as steps.

A performance indexing program according to a nineteenth aspect of the present invention is a program for causing a computer to perform performance indexing in an object detection model, and includes an image processing step of acquiring and appropriately processing an image, and an image processing step of acquiring and appropriately processing an image. A model preprocessing step for processing the image acquired in the step into a plurality of images according to various processing parameters, and an object position and likelihood (degree of certainty) for the plurality of images processed in the model preprocessing step. an object detection model including a model learning dictionary that infers the object detection model; and position information including a first detection frame and first likelihood information for each detected object in the plurality of images based on the inference result of the object detection model. a model post-processing step for correcting position information including a second detection frame that is an appropriate value and second likelihood information; and position information including the second detection frame that is an output result of the model post-processing step. a robustness verification step of verifying the robustness of the object detection model based on the second likelihood information and the various processing parameters; It is characterized by being a program.

According to the present invention, when performing performance indexing in an object detection model, the image from the image processing means is used as a basic image, and the model preprocessing means uses it as various processing parameters in S (arbitrary decimal) pixel (pixel) steps. , a total of N×M position shifted images are generated using N (any integer) times in the horizontal direction and M (any integer) times in the vertical direction, and for each of the multiple images, After inferring the position information and likelihood information including the detection frame using the object detection model, the model post-processing means corrects it so that individual identification is possible, and the robustness verification means calculates the likelihood distribution for the position of each detected object. By confirming the It becomes possible to accurately extract potential issues related to accuracy and performance during inference that the neural network itself, including the model, has latently. Furthermore, since it is possible to effectively formulate methods and methods for solving problems, it is possible to improve the detection accuracy and detection performance of the object detection model.

According to the present invention, the robustness verification means further includes a likelihood distribution indicating variations due to a position shift for each detected object, an average likelihood that is an average value of a valid region of likelihood, and a histogram of likelihoods. The standard deviation of likelihood, which is the standard deviation of the valid region of likelihood, the maximum likelihood, which is the maximum value of the valid region of likelihood, the minimum likelihood, which is the minimum value of the valid region of likelihood, and the By being equipped with a probability statistical calculation means that calculates any or all of the IOU values, we can extract features whose likelihood fluctuates due to fluctuations in the position of the detected object on the screen due to potential problems with object detection models. It becomes possible to do so. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters other than position shift, we will examine weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and reinforcement policies for neural networks including DNN models. This makes it possible to separate and accurately understand potential issues with the network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.

According to the present invention, when position information including a detection frame that is a correct answer and class identification information that is a correct answer exist, the robustness verification means further calculates the IOU distribution and class Identification accuracy rate distribution, average IOU value, average class identification accuracy rate, histogram of IOU values, histogram of class identification accuracy rate, standard deviation of IOU value, standard deviation of class identification accuracy rate, maximum IOU value , the maximum class identification accuracy rate, the minimum IOU value, and the minimum class identification accuracy rate, or all of them. It becomes possible to extract a feature in which position information including the detection frame and class identification information fluctuate due to fluctuations in the position of the detected object on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters other than position shift, we will examine weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and reinforcement policies for neural networks including DNN models. This makes it possible to separate and accurately understand potential issues with the network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.

According to the present invention, the probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item extraction means further perform the probability statistical calculation based on the likelihood, IOU value, and class classification correct answer rate. By providing a function that excludes images that are missing a certain percentage of pixels related to the target detection object from calculation targets, the position and model of the object in the image can be used as the basis for verification. Even in cases where the effective range of the object to be detected is missing depending on the position of the object after processing various processing parameters of the preprocessing means, the performance and features of the accurate object detection model and the versatility and robustness of the model learning dictionary It becomes possible to verify the gender. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary.

According to the present invention, the model preprocessing means further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnifications as various processing parameters, and then generates the above-mentioned position shifted image. By doing so, the robustness verification means equipped with the probability statistical calculation means calculates the likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, and the standard deviation of the likelihood for the position of each detected object for each of the L types of sizes. It becomes possible to check the maximum likelihood, minimum likelihood, and IOU value. Furthermore, check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each L type of size, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. becomes possible. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary.

Furthermore, by using the model preprocessing means as various processing parameters, P (arbitrary integer) types of contrast correction curves or gradation conversion curves are used to generate an image in which the brightness level is changed to an arbitrary value. , the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object for each of P (arbitrary integer) types of contrast correction curves or gradation conversion curves are determined by a robustness verification means equipped with a probability statistical calculation means. It becomes possible to check the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, the distribution, histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate. You can check the histogram, standard deviation, maximum value, and minimum value. Therefore, it is possible to improve the DNN model and strengthen the versatility and robustness of the model learning dictionary with respect to the brightness levels of the detected object and background that change depending on the weather conditions, shooting time, and illuminance conditions of the shooting environment.

Furthermore, by using Q (arbitrary integer) types of aspect ratios as various processing parameters by the model preprocessing means and generating images with changed aspect ratios, the robustness verification means equipped with the probability statistical calculation means The likelihood distribution, the average likelihood of the effective area of the likelihood, the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the likelihood distribution for each detected object position for each aspect ratio of Q (arbitrary integer) types It becomes possible to check the IOU value. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. , it becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various aspect ratios of the detected object and to enhance the versatility and robustness of the model learning dictionary.

Furthermore, the model preprocessing means uses R (arbitrary integer) types of angles as various processing parameters to generate images with changed rotation angles, and the robustness verification means equipped with probability statistical calculation means (Arbitrary integer) Likelihood distribution for the position of each detected object by type of angle, average likelihood of the valid area of likelihood, histogram of likelihood, standard deviation of likelihood, maximum likelihood, minimum likelihood, and IOU value It becomes possible to confirm. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate, It becomes possible to check the minimum value. Therefore, it is possible to improve the DNN model for various inclinations of the detected object and to enhance the versatility and robustness of the model learning dictionary.

Furthermore, the model post-processing means uses a series of means for individually linking each output result and various processing parameters for each detected object and outputting them to the robustness verification means. For this purpose, it becomes possible to extract features whose likelihood changes due to fluctuations in the position of the detected object in the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently.

Furthermore, based on the results of the robustness verification means equipped with a means for extracting necessary items for learning reinforcement, training images are prepared and the model learning dictionary is retrained by the built-in or external dictionary learning means, thereby determining the area near the detected object. When combined with various processing parameters other than position shift in any range (position such as left, right, top, bottom, and depth of an object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.), deep learning etc. It is possible to accurately understand weaknesses and reinforcement policies in generality and robustness against various fluctuation conditions caused by the model learning dictionary that is created, by separating them from potential issues with neural networks themselves, including DNN models. Become. Therefore, effective learning image data and supervised data can be applied to deep learning and the like. Therefore, it is possible to enhance the versatility and robustness of the model learning dictionary.

According to the present invention, when a plurality of images to be input to an object detection model are processed by the model preprocessing means, the average brightness level of the valid images is pasted to the blank area where no valid images exist due to the processing. By generating the likelihood distribution for each detected object position, the average likelihood of the effective area of the likelihood, and the likelihood histogram, the influence of the features in the margin area on the inference accuracy of the object detection model can be reduced. It becomes possible to more accurately calculate the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.

According to the present invention, the model post-processing means further includes position information including zero or a plurality of first detection frames including undetectable and false detection frames and a first likelihood for each detection object present in the image. IOU (Intersection over By having an individual identification means that corrects the position information and second likelihood information including the second detection frame with the maximum likelihood for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the union) value, abnormal data can be detected. It is possible to correct the position information and likelihood information including the detection frame for each detected object to appropriate information, so the likelihood distribution and the average likelihood of the effective area of the likelihood for the position of each detected object can be corrected. It becomes possible to more accurately calculate the histogram of the degree, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.

According to the present invention, if there is position information and class identification information including a correct detection frame for each detected object, the model post-processing means selects the correct detection frame according to the contents of various processing parameters. It has a function to correct the position information contained in the image, and for each detection object present in the image, it corrects the position information and first likelihood information including zero or multiple first detection frames including undetectable and false detection. is an index that indicates how much overlap between an arbitrary threshold value T (an arbitrary decimal number) for the first likelihood information and the area of the positional information including the correct detection frame and the area of the positional information including the first detection frame. It has an individual identification means that corrects each detected object to position information and second likelihood information including a second detection frame with the maximum likelihood based on an arbitrary threshold value U (arbitrary decimal number) for a certain IOU (Intersection over Union) value. By doing this, it is possible to eliminate abnormal data and correct the position information including the detection frame and likelihood information for each detected object to the optimal information, so the likelihood distribution and the effective area of the likelihood for the position of each detected object can be corrected. It becomes possible to accurately calculate the average likelihood, the histogram of the likelihood, the standard deviation of the likelihood, the maximum likelihood, the minimum likelihood, and the IOU value by comparing them with the correct data. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.

Furthermore, it becomes possible to more accurately calculate the IOU value, Precision, Recall, AP value, and mAP value, which are indicators of overall and average inference accuracy and performance using validation data. The accuracy of indexing the object detection model 300 and the model learning dictionary 320 is improved.

According to the present invention, the robustness verification means further extracts, for each processing parameter, a position or region where the likelihood distribution for each detected object is below an arbitrary threshold, and where the average likelihood is below an arbitrary threshold. Extraction of detected objects, extraction of detected objects whose standard deviation of likelihood is greater than or equal to an arbitrary threshold, extraction of detected objects whose maximum likelihood is less than or equal to an arbitrary threshold, and extraction of detected objects whose minimum likelihood is less than or equal to an arbitrary threshold. By having a method for extracting items necessary for learning reinforcement that includes any or all of the extraction of detected objects, it is possible to eliminate weaknesses in versatility and robustness against various fluctuation conditions caused by model learning dictionaries created by deep learning, etc., and strengthen them. It becomes possible to understand the policy more accurately by separating it from the potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.

According to the present invention, the robustness verification means further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and selects an arbitrary value in the class identification accuracy rate distribution for each detected object, for each various processing parameters. Extraction of a position or area where the average IOU value is below a threshold value, Extraction of detected objects whose average IOU value is below an arbitrary threshold value, Extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and IOU value standard Extracting a detected object whose deviation is greater than or equal to an arbitrary threshold; Extracting a detected object whose standard deviation of class identification accuracy is greater than or equal to an arbitrary threshold; Extracting a detected object whose maximum IOU value is less than or equal to an arbitrary threshold; Extraction of detected objects whose maximum IOU value is below an arbitrary threshold, extraction of detected objects whose minimum IOU value is below an arbitrary threshold, or extraction of detected objects whose minimum class identification accuracy rate is below an arbitrary threshold By having a learning reinforcement necessary item extraction means that includes any or all of the above, the versatility resulting from the model learning dictionary created by deep learning etc. based on position information including the detection frame and class identification information. It becomes possible to more accurately understand weak points in robustness and reinforcement policies against various fluctuation conditions by separating them from potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. Similar components are given the same reference numerals, and similar explanations will not be repeated.

(Embodiment 1)
FIG. 1 is a block diagram showing a performance indexing device 10 for detecting objects in images according to Embodiment 1 of the present invention.

Note that each means, each function, and each process described in Embodiment 1 of the present invention described later may be replaced with a step, and each device may be replaced with a method. Moreover, each means and each device described in Embodiment 1 of the present invention may be realized as a program executed by a computer.

The image processing means 100 that acquires and appropriately processes images includes a lens 101, an image sensor 102 that is a device that receives light emitted from an object through the lens, and converts the brightness of the light into electrical information. Equipped with black level adjustment function, HDR (high dynamic range) composition function, gain adjustment function, exposure adjustment function, defective pixel correction function, shading correction function, white balance function, color correction function, gamma correction function, local tone mapping function, etc. The main component is an image processing processor 103. Further, functions other than those described above may also be provided. The lens 101 may be, for example, a standard zoom lens, a wide-angle zoom lens, a fisheye lens, or the like, depending on the purpose of object detection. In the environment in which the object to be detected is photographed, time-series fluctuation conditions such as illuminance are detected and controlled by various functions installed in the image processing processor 103, and the object to be detected is detected while suppressing fluctuations. Apply image processing to make it easier to see or find.

The image generated by the image processing means 100 is input to the image output control means 110 and sent to a display and data storage means 120 such as a monitor device, an external memory such as a PC (personal computer), a cloud server, etc. Ru. The image output control means 110 may have a function of transmitting image data according to horizontal and vertical synchronization signals of the display and data storage means 120, for example. The image output control means 110 also refers to the position information 401 including the second detection frame, which is the output result of the model post-processing means 400, and the second likelihood information 402, so as to mark the detected object. It may also have a function of superimposing frame depiction and likelihood information on the output image. Further, the position information 401 including the second detection frame and the second likelihood information 402 are directly transmitted to the display and data storage means 120 using a serial communication function, a parallel communication function, or a UART that converts both. There may be.

On the other hand, in order to perform object detection using the object detection model 300, the image data generated by the image processing means 100 is input to the model preprocessing means 200, and the model is processed so that the image is suitable for input to the object detection model 300. The input image is processed into an input image 210. Here, if the object detection model 300 is a model that performs object detection using image data with only brightness levels, the image for object detection generated by the image processing means 100 has only brightness levels. If the object detection model 300 is a model that performs object detection using color image data including color information, the object detection model generated by the image processing means 100 may be converted into brightness data. The image may be color image data having pixels such as RGB. In the first embodiment, as an example, the object detection model 300 is a model that performs object detection using image data of only the brightness level, and the image for object detection generated by the image processing means 100 is A case where the luminance data is converted into luminance data having only levels will be explained.

The model preprocessing means 200 may be configured with electronic circuits such as adders, subtracters, multipliers, dividers, and comparators, or may be configured with functions (library ) or a fisheye lens to an image equivalent to a human visual field, and an image processing processor 290 comprising a CPU or an arithmetic processor. Note that the image processing processor 290 may be replaced by the image processing processor 103 included in the image processing means 100. The model preprocessing means 200 uses the above-mentioned affine transformation function 291, projective transformation function 292, image processing processor 290, or electronic circuit to have a function of cutting out a specific area, and a function of cutting out an image when cutting out a specific area. A position shift function 220 for shifting the image to an arbitrary position in the horizontal and vertical directions, a resizing function 230 for enlarging or reducing the image to an arbitrary magnification, and a rotation function 240 for rotating the image to an arbitrary angle. An aspect ratio change function 250 for arbitrarily changing the ratio between the horizontal and vertical directions, a gradation conversion function 260 for changing the brightness level with an arbitrary curve, and a dewarp function for performing distortion correction, cylindrical conversion, etc. It may include some or all of the function 270 and a margin padding function 280 for padding an area where no valid pixels exist with an arbitrary brightness level. Note that, in order to create a performance index for the object detection model 300, the model preprocessing means 200 uses the image data generated by the image processing means 100 as a reference image, and processes it using various processing parameters 510 according to the purpose of creating a performance index. , is processed into a plurality of model input images 210 and output to the object detection model 300, and its usage and operation will be explained in the explanation of the robustness verification means 500 described later. Note that in the first embodiment, as an example, the object detection model 300 is a model that performs object detection using image data of only the brightness level, and the object detection model 300 is a model for object detection generated by the model preprocessing means 200. A case will be described in which the input image 210 is converted into luminance data having only luminance levels.

The object detection model 300 is, for example, a deep neural network (DNN) model 310 that uses a model learning dictionary 320 and a convolutional neural network (CNN), which is a model of human brain neurons. Consists of. The DNN model 310 uses, for example, YOLO (for example, see Non-Patent Document 1), SSD, etc., which are models with a high advantage in detection processing speed. Furthermore, when priority is given to detection accuracy, FasterR-CNN, EfficientDet, or the like may be used, for example. Furthermore, when performing mainly class identification without detecting the position of the object, for example, MobileNet may be used.

FIG. 2 shows a schematic configuration of an artificial neuron model 330 and a neural network 340, which are the basic configuration of the CNN described above. As shown in FIG. 2 and (Equation 3), the artificial neuron model 330 receives the output signals of one or more neurons such as X0, An output to the next neuron is generated through an activation function 350 for the sum of the multiplication results. b is a bias (offset).

Furthermore, a collection of these many artificial neuron models is a neural network 340. The neural network 340 is composed of an input layer, an intermediate layer, and an output layer, and the output of each artificial neuron model 330 is input to each artificial neuron model 330 at the next stage. The artificial neuron model 330 may be realized by hardware such as an electronic circuit, an arithmetic processor, and a program. For example, the weighting coefficients of each artificial neuron model 330 are calculated as dictionary data using deep learning. The dictionary data, that is, the model learning dictionary 320 shown in FIG. It is something that is initially learned or re-learned.

Next, the activation function 350 will be explained. It is known that the activation function 350 needs to be a non-linear transformation, since repeating a linear transformation only transforms it into a linear transformation. The activation function 350 is a step function that simply identifies "0" or "1", a sigmoid function 351, a ramp function, etc.; In recent years, a ramp function such as ReLU (Rectified Linear Unit) 352 is often used because the calculation speed decreases. ReLU352 is a function whose output value is always 0 when the input value to the function is less than or equal to 0, and whose output value is the same as the input value when the input value is greater than 0. Even when the layers of the neural network 340 become deep, the gradient is less likely to disappear, and the calculation formula is simple, so it has an advantage in processing speed. Furthermore, Leaky ReLU (Leaky Rectified Linear Unit) 353, which is a derivative of ReLU 352, is also increasingly used because it has better accuracy than ReLU 352. LeakyReLU353 multiplies the input value by α if the input value is lower than 0 (α multiplication is, for example, 0.01 times (basic)), and if the input value is higher than 0, the output value is the same value as the input value. This is the function. Other activation functions 350 include a softmax function that is used when identifying the class of a detected object, and a suitable function is used depending on the purpose of use. The softmax function converts and outputs a plurality of output values so that the sum total becomes 1.0 (100%).

3A and 3B are examples of the configuration of a YOLO model 360, which is one of the DNN models 310. The YOLO model 360 shown in FIG. 3A may have, for example, a horizontal pixel Xi and a vertical pixel Yi as the input image size. Convolution layers 370 to 387 that can compress and extract region-based feature amounts by convolving the region of surrounding pixels by filtering, and Pooling that functions to absorb positional deviation of the filter shape in the input image. The basic configuration may be layers 390 to 395, a fully connected layer, and an output layer. In addition, for example, it includes a first detection layer 361, a second detection layer 362, and a third detection layer 363 for detecting the position of the object and classifying (identifying) the object, and The upsampling layer 364 and 365 for upsampling using deconvolution may also be used. The model input image size, the pixel size of the convolution layer, pooling layer, detection layer, upsampling layer, etc., the number and combination of various layers, the number and arrangement of detection layers, etc., depend on the intended use. It may be increased, decreased, or changed.

The Convolution layers 370 to 387 correspond to models of simple cells that respond to a specific shape or various shapes, and are used to recognize objects with complex shapes.

On the other hand, the Pooling layers 390 to 395 correspond to models of complex cells that function to absorb spatial deviations in shape, and when the position of an object of one shape shifts, it changes to another shape. It works so that all parts can be regarded as having the same shape. By combining these Convolution layers 370 to 387 and Pooling layers 390 to 395, it becomes possible to improve the accuracy of object detection by making it robust against movement and changes of detection objects with various complex shapes. Become.

Upsampling layers 364 and 365 perform class classification on the original image and use the results in each layer of the CNN as a feature map through skip connections shown at 366 and 367 in FIG. , and the third detection layer 363 enable detailed region identification. Note that the

skip connections

367 and 366 connect networks having the same configuration as the convolution layers 373 and 374 after the convolution layers 385 and 381, respectively.

Next, a method for calculating a confidence score 317 (corresponding to likelihood), which corresponds to the detection accuracy and confidence of the YOLO model 360 in an embodiment, will be explained with reference to FIG. 3B, in which one person is used as the detection object. . When using a one-stage DNN model such as YOLO, which is said to have a high processing speed, the model input image 311 is Divide the image area into grid cells of arbitrary size (a 7x7 example is shown in FIG. 3B). A step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object)×IOU), and a step 312 of estimating a plurality of Bounding BBoxes and Confidence (reliability) 313 (Pr(Object) ) 315 calculation process 314 are processed in parallel. Thereafter, both are multiplied when calculating a confidence score 317 in a final detection step 316. Therefore, processing speed can be improved by simultaneously detecting the position of the object and identifying the class. Note that a position information detection frame 318 including the first detection frame indicated by a dotted line in the final detection step 316 is a detection frame displayed as a detection result for a person.

For each detected object present in one image output from the object detection model 300 shown in FIG. After inputting the degree information 302 to the model post-processing means 400, the position information 301 including the first detection frame is sorted based on mutual IOU values, the maximum likelihood information 302 is determined, etc., and each detected object is The position information 401 and the second likelihood information 402 are corrected to include the second detection frame considered to be the most appropriate. Position information 401 including the second detection frame and second likelihood information 402 are input to the image output control means 110 and the robustness verification means 500. Here, the position information 401 including the second detection frame is, for example, information including the center coordinates, horizontal width, and vertical height of the detection frame, and the second likelihood information 402 is, for example, These are the likelihood and class identification information that indicate detection accuracy.

The IOU value will be explained with reference to FIG. The denominator of the formula representing the IOU value 420 in FIG. 4(a) is the Area of Union 422 in the above-mentioned (Formula 1), which is the area of the union of the two frame areas to be compared. The numerator of the formula representing the IOU value 420 in (a) of FIG. 4 is the Area of Intersection 423 in the above-mentioned (Formula 1), which is the area of the common portion of the two frame regions to be compared. The maximum value is "1.0", indicating that the two frame data completely overlap. The larger the IOU value 420 of the position information 301 including the first detection frame, which is the output result of the object detection model 300, and the position information 621 including the correct detection frame (described later), the better the object detection is. As an example, when detecting a person, as shown in FIG. 4B, the groundtruth BBox 425, which is the correct answer frame for the person 424, and the PredictedBBox 426, which is calculated as a result of inference (prediction), are 11 in the horizontal direction and 11 in the vertical direction. If there is a difference of about %, the IOU value 427 of both will drop to about 0.65. As can be seen from this point, it is often used as an index to sensitively verify the accuracy and performance of object detection.

The model post-processing means 400 shown in FIG. and the first likelihood information 302, how much does an arbitrary threshold T (an arbitrary decimal number) for the first likelihood information 302 and the area of the position information 301 including the mutual first detection frame overlap? Corrected position information 401 and second likelihood information 402 including the maximum likelihood second detection frame for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index representing the It may also be characterized by having an individual identification means 410.

For example, an example of the processing of the individual identification means 410 of the model post-processing means 400 will be explained using the flowchart of FIG. 5A and a model input image 440 in which a person is located close to each other after two names as detected objects as shown in FIG. 5B. do.

First, as shown in FIG. 5A, in input step S430, position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input. . At that time, as shown in FIG. 5B, among the

position information

441, 442, 443, and 444 including the four first detection frames output from the object detection model 300 and the four first likelihood information, It is assumed that

likelihoods

445, 446, 447, and 448 are input.

Next, in setting step S431, the IOU threshold "U" and the likelihood threshold "T" are set. This example shows a case where "U"=0.7 and "T"=0.5 are set as the threshold values.

Next, in the comparison step S432, the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold "T", it is determined to be true, and the mutual IOU value calculation step S434 is performed. , a process is performed to calculate the IOU value of the mutual combination of the position information 301 including all the first detection frames to be calculated. In FIG. 5B, since the likelihood 446 is 0.33 and is less than the threshold value "T" = 0.5, the likelihood 446 and the position information 442 including the first detection frame that is pseudo-detected to include two people are shown. The first likelihood information included is deleted from the calculation target. There are three calculation candidates remaining, and the IOU values of mutual combinations of

position information

441, 443, and 444 including the respective first detection frames are calculated.

Next, in comparison step S435, all mutual IOU values are compared with the threshold value "U", and if the mutual IOU value is less than the threshold value "U" and determined to be false, it is determined that the detection results are independent. Then, in output step S437, the position information 401 including the second detection frame and the second likelihood information 402 are outputted, and if the mutual IOU value is equal to or greater than the threshold value "U", it is determined to be true, and the same detection It is assumed that the object is detected redundantly, and the process proceeds to the next maximum likelihood determination step S436. In FIG. 5B, since the mutual IOU value between the position information 441 including the first detection frame and the other two is less than the threshold value "U" = 0.7, the position including the first detection frame is treated as independent detection information. Outputting first likelihood information including information 441 and likelihood 445 (0.85) as second likelihood information including position information 451 including second detection frame and likelihood 453 (0.85) Output by S437. On the other hand, since the position information 443 and 444 including the first detection frame have mutual IOU values close to each other, it is determined that "U" is equal to or greater than 0.7, and the process proceeds to the next maximum likelihood determination step S436.

Finally, in the maximum likelihood determination step S436, the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false. The likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S437, the position information 401 including the second detection frame and the second detection frame are deleted. It may also be output as likelihood information 402. In FIG. 5B, as a result of performing maximum likelihood determination from two likelihoods, 447 (0.75) and 448 (0.92), the first likelihood information including likelihood 447 (0.75) and the position information 443 including the first detection frame are deleted from the calculation target, and the position including the first likelihood information including the likelihood 448 (0.92) determined to be the maximum likelihood and the first detection frame is calculated. The information 444 is output as position information 452 including the second detection frame and second likelihood information including the likelihood 454 (0.92) in output step S437.

Note that the higher the likelihood threshold "T", the higher the reliability of detected information, but on the other hand, there may be cases where detection is not possible, so it should be set appropriately according to the performance of the object detection model 300. is desirable.

Note that if the mutual IOU value threshold "U" is set low, when there are multiple detected objects, the detection results of multiple detected objects will be merged more than expected, especially for objects that are close to each other. Leaks are more likely to occur. On the other hand, if the value is set high, duplicate detection results may remain even though the same object is detected. Therefore, it is desirable to set appropriately according to the performance of the object detection model 300.

Note that the individual identification means 410 may perform individual identification using a combination of steps other than the flowchart shown in FIG. 5A. For example, the class identification information in the first likelihood information 302 may be used to limit the objects for which mutual IOU values are calculated in the mutual IOU value calculation step S434 to the same class, or the maximum When determining the likelihood, processing for determining the maximum likelihood within the same class may be added.

By having the model post-processing means 400 and the individual identification means 410 as shown in FIGS. 5A and 5B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.

In addition, the model post-processing means 400 shown in FIG. If there is positional information 621 including a frame and class identification information 622 that is the correct answer, the positional information 621 that includes the detected frame that is the correct answer according to the contents of various processing parameters 510 is generated using an affine transformation function, a projective transformation function, an arithmetic processor, etc. For each one or more detected objects output from the object detection model 300, the position information 301 includes zero or more first detection frames including undetectable and false detection. For the likelihood information 302 of Correction is made to position information 401 including the second detection frame with the maximum likelihood and second likelihood information 402 for each detected object using an arbitrary threshold value U (arbitrary decimal number) for the IOU value, which is an index showing how much overlap. It may also be characterized by having an individual identification means 410.

The annotation means 620 may create supervised data by adding class identification information and a groundtruth BBox, which is a correct answer frame, to the image stored in the display and data storage means 120, for example.

For example, using the flowchart of FIG. 6A and the model input image 470 in which a person is close to two names as a detection object as shown in FIG. 6B, the position information 621 including the correct detection frame and the correct class identification An example of the processing of the individual identification means 410 of the model post-processing means 400 when the information 622 exists will be described.

First, as shown in FIG. 6A, in input step S430, position information 301 including zero or a plurality of first detection frames including undetectable and false detection for each detection object and first likelihood information 302 are input. . Further, in input step S460, position information 621 including a detection frame that is the correct answer for each detected object and class identification information 622 that is the correct answer are input. At that time, as shown by the dotted line in FIG. 6B,

position information

471, 472, 473, and 474 including the four first detection frames output from the object detection model 300, and four first likelihood information It is assumed that

likelihoods

475, 476, 477, and 478 are input. Further, as shown by the solid line in FIG. 6B,

position information

480 and 481 including two correct detection frames outputted from the annotation means 620, and

class identification information

482 and 483 indicating "person" as two correct answers. shall be input.

Next, in setting step S431, the IOU threshold "U" and the likelihood threshold "T" are set. This example shows a case where "U"=0.5 and "T"=0.5 are set as the threshold values.

Next, in the comparison step S432, the likelihood in the first likelihood information 302 is compared with the threshold value "T", and if the likelihood is determined to be false because the likelihood is less than the threshold value "T", in the deletion step S433, the corresponding The position information 301 including the first detection frame and the first likelihood information 302 are deleted from the calculation target, and if the likelihood is equal to or higher than the threshold value “T”, it is determined to be true, and the IOU value with the correct frame is calculated. In step S461, processing is performed to calculate the IOU value of the combination of the position information 301 including all the first detection frames to be calculated for each piece of position information 621 including the correct detection frame. In FIG. 6B, since the likelihood 476 is 0.33 and is less than the threshold "T" = 0.5, the likelihood 476 is combined with the position information 472 including the first detection frame that is pseudo-detected to include two people. The first likelihood information included is deleted from the calculation target. There are three calculation candidates remaining, and the IOU values of the

position information

471, 473, and 474 including the first detection frame are calculated for each of the

position information

480 and 481 including the correct detection frame.

Next, in comparison step S462, all IOU values are compared with the threshold value "U", and if the IOU value for the position information 621 including the correct detection frame is less than the threshold value "U" and determined to be false, It is determined that the position information 301 and the first likelihood information 302 including the corresponding first detection frame are deleted from the calculation target in a deletion step S433, and the IOU value is set to the threshold “U”. ``If it is, it is determined to be true, and it is regarded as a detection target candidate with a small difference from the correct answer frame, and the process proceeds to the next class identification determination step S463. In FIG. 6B, candidates that are determined to be false are not applicable, and the three calculation candidates become the determination targets of the class identification determination step S463.

Next, in class identification determination step S463, the class identification information 622 that is the correct answer and the class identification information in the first likelihood information 302 that is the correct answer are compared, and if they are identified as different classes, it is determined to be false. Then, in a deletion step S433, the position information 301 including the corresponding first detection frame and the first likelihood information 302 are deleted from the calculation target, and if they are identified as the same class, it is determined to be true. The process then proceeds to the next maximum likelihood determination step S436. In FIG. 6B, assuming that all the candidates are determined to be "human" as a result of class identification, the three calculation candidates are directly subjected to the determination in the maximum likelihood determination step S436.

Finally, in the maximum likelihood determination step S436, the information other than the one with the maximum likelihood is determined to be false, and in the deletion step S433, the position information 301 including the corresponding first detection frame and the first detection frame are determined to be false. The likelihood information 302 of is deleted from the calculation target, and the one with the maximum likelihood is determined to be true, and in output step S464, the position information 401 including the second detection frame and the second detection frame are deleted. The likelihood information 402 and the calculated IOU value may be output. In FIG. 6B, as the detection results of the position information 481 including the correct detection frame and the correct class identification information 483, the maximum likelihood is calculated from the two likelihoods 477 (0.75) and 478 (0.92). As a result of the degree determination, the first likelihood information including the likelihood 477 (0.75) and the position information 473 including the first detection frame are deleted from the calculation target, and the likelihood determined to be the maximum likelihood is 478 (0.92) and position information 474 including the first detection frame are combined with position information 491 including the second detection frame and second likelihood information 493 (0.92). It is output as likelihood information in output step S464. Further, an IOU value of 495 (0.85) is output in output step S464. Furthermore, as the detection results of the position information 480 including the correct detection frame and the correct class identification information 482, first likelihood information including the likelihood 475 (0.85) determined to be the maximum likelihood and the first likelihood information including the likelihood 475 (0.85) determined to be the maximum likelihood Position information 471 including one detection frame is output as position information 490 including a second detection frame and second likelihood information including likelihood 492 (0.85) in output step S464. Further, the IOU value 494 (0.73) is outputted in output step S464.

Note that even if the threshold value "U" of the IOU value with the correct answer frame is set lower than that of the individual identification means 410 described with reference to FIGS. 5A and 5B, and more calculation candidates are left, the detection results in a correct answer. Since direct comparison can be made with the position information 621 including the frame, there is an advantage that detection omissions are less likely to occur and the accuracy of the detection results is improved. Furthermore, by arbitrarily changing the threshold value “U” and processing, it is also possible to understand and verify the accuracy of the detection frame of the position information 301 including the first detection frame calculated by the object detection model 300. Become. Therefore, learning conditions necessary for improving the accuracy of the detection frame can also be extracted, so that the dictionary learning means 600, which will be described later in Embodiment 2, can be used for general purposes regarding position information including the detection frame of the model learning dictionary 320. This makes it possible to more accurately enhance performance and robustness.

By having the model post-processing means 400 and the individual identification means 410 as shown in FIGS. 6A and 6B, it is possible to eliminate abnormal data and to obtain position information 401 including a second detection frame and second likelihood information for each detected object. 402 can be corrected to appropriate information.

A series of means for generating position information 401 including the second detection frame and second likelihood information 402 using the image processing means 100, model pre-processing means 200, object detection model 300, and model post-processing means 400 is as follows: This was a conventional first performance indexing device 30 shown in FIG. 17 for analyzing the robustness and reinforcement policy of a model learning dictionary for a model that detects the position of an object in an image and identifies its class.

As an example, regarding the problem of the conventional first performance indexing device 30, a one-stage DNN model is typified by a one-stage DNN model that is said to have a high processing speed because it simultaneously detects the position of an object and identifies its class. The case where the YOLO model 360 is applied will be explained using FIGS. 7A and 7B. As shown in FIG. 7A, when detecting one person in the model input image 201 processed by the model preprocessing means 200 into Xi pixels in the horizontal direction and Yi pixels in the vertical direction, the camera An image in which the image is shifted by 2 pixels in the horizontal direction due to time-series changes with respect to image 201, which was acquired at a horizontal reference position at a certain reference time, due to camera shake or vibration in the time-series when acquiring images, etc. 202 and an image 203 in which the image has been shifted by 4 pixels in the horizontal direction are input to the YOLO model 360 (object detection model 300), and as a result of correction by the model post-processing means 400, a second detection frame is created. When the included

position information

207, 208, and 209 and the

likelihoods

214, 215, and 216 in the second likelihood information are calculated, the position of the person in the image is calculated even though the same person is detected. Although there is only a slight fluctuation and shift in the horizontal direction, the respective likelihoods may vary greatly, such as 0.92, 0.39, and 0.89.

On the other hand, as shown in FIG. 7B, the distance between the camera and the person is 1 m in image 204, 2 m in image 205, and 3 m in image 206, as a result of the change in the size of the person and the position in the image. When the

positional information

211, 212, and 213 including the detection frame and the

likelihoods

217, 218, and 219 in the second likelihood information are calculated, when considering the performance of the original YOLO model, the person size is small. It is known that the detection accuracy and performance deteriorate as the distance between the person and the person increases. The degree 217 is 0.92, and the likelihood 219 in the second likelihood information of the image 206 with the detected object distance of 3 m is 0.71, whereas the second likelihood information of the image 205 with the detected object distance of 2 m is Irregular results may be obtained in which the likelihood 218 is significantly reduced to 0.45.

As a means of understanding these irregular phenomena and analyzing their causes, according to an embodiment of the present invention, as shown in FIG. When processing a plurality of model input images 210, as various processing parameters 510, in S (arbitrary decimal) pixel steps, N (arbitrary integer) times in the horizontal direction and M (arbitrary integer) in the vertical direction. It may also include a position shift function 220 that uses the position shifts to generate a total of N×M position shifted model input images 221 to 224. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 may be a function realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.

Further, according to an embodiment, when processing the plurality of model input images 210 to be input to the object detection model, the model preprocessing means 200 further sets L (arbitrary integer) types of arbitrary processing parameters 510 to the object detection model. The resizing function 230 generates an enlarged or reduced image using a magnification of Even if it includes a position shift function 220 that uses M (any integer) position shifts in the vertical direction to generate a total of N×M×L resized and position-shifted model input images 210. good. Further, it may be provided with a function of cutting out an arbitrary area. Note that the position shift function 220 and the resizing function 230 may be realized by executing the affine transformation function 291 and the projective transformation function 292 in the image processing processor 290.

As an example, FIG. 9 shows a case where three types (L=3) of resized images are generated: a reference size image 232, a 30% reduced image 231, and a 30% enlarged image 233. For each of the

images

231, 232, and 233, as shown in FIG. It may also be something that processes.

A plurality of model input images 210 processed by the position shift function 220 and resizing function 230 of the model pre-processing means 200 as shown in FIGS. 8 and 9 are combined with the object detection model 300 shown in FIG. 1 and the model post-processing means 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510. It is input to a robustness verification means 500 that verifies robustness.

In the case of a model that detects objects in images acquired by a camera, etc., the items and various variable conditions to be verified by the robustness verification means 500 include, for example, the background (scenery), camera lens specifications, the height at which the camera is mounted, etc. Detection target area and field of view including image size such as elevation/depression angle, dewarping processing method when using a fisheye lens, illuminance changes depending on sunlight and lighting, special conditions such as blackout, overexposure, backlighting, clear weather, etc. Weather conditions include cloudy weather, rain, snow, and fog. In addition, the position (left, right, top, bottom, and depth) of the target detection object in the image, size, brightness level, shape and characteristics including color information, aspect ratio, rotation angle, number of target detection objects, mutual overlap status, and attachments are also included. These include the type, size, attached position, whether or not the lens has IR cut, the moving speed of the target detection object, and the moving speed of the camera itself. Furthermore, depending on the purpose of use, items and conditions other than those described above may be added. Various processing parameters 510 are set based on these various conditions and items. Alternatively, various processing parameters 510 are selected or determined. Various processing parameters 510 are input to model pre-processing means 200 and model post-processing means 400. Various processing parameters 510 input to the model preprocessing means 200 include parameters related to the position shift function 220 for verifying the influence of fluctuations accompanying the object position, camera lens specifications, the height and elevation/depression angle at which the camera is mounted, etc. It is a combination of parameters related to the resizing function 230 to verify the versatility and robustness of the detection target area including the image size and the object size of the field of view, such as the conditions for , and other multiple parameters described below. It's okay to have one.

According to one embodiment, the model post-processing means 400 individually links various processing parameters 510 used in processing the plurality of images by the model pre-processing means 200 and the output results of the individual identification means 410 for each detected object. The detected detection result 403 (including position information 401 including the second detection frame, second likelihood information 402, etc.) may be output to the robustness verification means 500.

According to an embodiment, the robustness verification means 500 is based on the likelihood in the position information 401 including the second detection frame and the second likelihood information 402, which are the output results of the model post-processing means 400. In addition, for each of the various processing parameters 510, a likelihood distribution 540 indicating the variation due to the position shift of each detected object, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a likelihood The standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood, the maximum likelihood 503 which is the maximum value of the valid region of likelihood, the minimum likelihood 504 which is the minimum value of the valid region of likelihood, and the likelihood It may be characterized by comprising a probability statistical calculation means 520 that calculates any or all of the IOU values 505 for the IOU values 505.

According to an embodiment, the robustness verification means 500 uses the output result of the model post-processing means 400 when there is position information 621 including a correct detection frame and correct class identification information 622 for each detected object. Calculated from the IOU value of the position information 401 including the second detection frame, the position information 621 including the correct detection frame, the class identification information in the second likelihood information 402, and the correct class identification information 622. Based on the class identification accuracy rate, the IOU distribution and class identification accuracy rate distribution showing the variation due to the position shift of each detected object with respect to the IOU value and class identification accuracy rate, IOU value and class identification are calculated for each of various processing parameters 510. The average IOU value which is the average value of the effective area of the correct answer rate, the average class identification correct answer rate, the histogram of the IOU value, the histogram of the class identification correct answer rate, and the standard deviation of the effective area of the IOU value and the class identification correct answer rate. The standard deviation of the IOU value and the standard deviation of the class identification accuracy rate, the maximum IOU value and the maximum class identification accuracy rate which are the maximum values of the valid area of the IOU value and the class identification accuracy rate, and the effective area of the IOU value and the class identification accuracy rate. The present invention may be characterized in that it includes a probability statistical calculation means 520 that calculates either or both of the minimum IOU value, which is the minimum value of , and the minimum class identification correct answer rate.

According to an embodiment, the robustness verification means 500 further extracts a position or region in which the likelihood distribution 540 for each detected object is equal to or less than an arbitrary threshold value for each of the various processing parameters 510, and extracts a position or region where the average likelihood 501 is an arbitrary value. extraction of detected objects whose standard deviation 502 of likelihood is equal to or greater than an arbitrary threshold; extraction of detected objects whose maximum likelihood 503 is equal to or less than an arbitrary threshold; The present invention is characterized by having learning reinforcement necessary item extraction means 530 that includes one or both of the following: extracting a detected object whose IOU value 504 is equal to or less than an arbitrary threshold value; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold value. It may be something that you do.

According to an embodiment, the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and extracts an arbitrary value in the class identification accuracy rate distribution for each of the various processing parameters 510. Extraction of a position or area where the average IOU value is below a threshold value, extraction of detected objects whose average IOU value is below an arbitrary threshold value, extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and standard deviation of the IOU value. extraction of detected objects for which the standard deviation of the class classification accuracy rate is greater than or equal to an arbitrary threshold; extraction of detected objects for which the maximum IOU value is less than or equal to an arbitrary threshold; Extraction of detected objects whose class identification accuracy rate is below an arbitrary threshold, extraction of detected objects whose minimum IOU value is below an arbitrary threshold, and extraction of detected objects whose minimum class identification accuracy rate is below an arbitrary threshold. It may be characterized by having learning reinforcement necessary item extraction means 530 that includes any or all of them.

According to an embodiment, the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 perform probability statistical calculations based on the likelihood, IOU value, and class identification correct answer rate. Furthermore, it may be characterized by having a function of excluding from the calculation target images in which pixels related to the target detection object are missing at an arbitrary rate.

It becomes possible to specify the reinforcement targets of the model learning dictionary 320 based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530. Furthermore, it is also possible to extract problems for the object detection model 300. Furthermore, the versatility and robustness of the model learning dictionary 320 can be improved by inputting this extracted information 531 into a dictionary learning means 600, which will be described later in the second embodiment, and reflecting it in the selection of learning materials, the augmentation method, and the learning parameters. It becomes possible to strengthen it.

These image processing means 100, a model preprocessing means 200 having a position shift function 220, a resizing function 230, etc., and a plurality of model input images 210 processed by the model preprocessing means 200 according to various processing parameters 510, An object detection model 300 and a model post-processing means 400 that calculate position information 401 including a second detection frame and second likelihood information 402 for a plurality of model input images 210; The versatility and robustness of the object detection model 300 is input by inputting a detection result 403 in which various processing parameters 510, position information 401 including the second detection frame, and second likelihood information 402 are linked individually for each detected object. A series of means including a robustness verification means 500 for verifying the performance of the object detection model 300 and the robustness and learning reinforcement policy of the model learning dictionary 320 for detecting the position and class identification of objects in images. 1 is an embodiment of a performance indexing device 10 in object detection of the present invention for analysis. The performance indexing device for object detection of the present invention may further include a dictionary learning means 600 of Embodiment 2 described later for generating a model learning dictionary 320, and a second mAP calculation means 650. good.

As an example, using the performance indexing device 10 in object detection according to the first embodiment of the present invention shown in FIG. FIG. 10 and FIG. 11 show the results of analyzing the irregular variation phenomenon in the likelihood, etc., which is the detection result by the object detection model 300, with respect to , and the size of the detected object.

The analysis results shown in FIGS. 10 and 11 are obtained by setting Xi, which is the number of pixels in the horizontal direction, of the plurality of model input images shown in FIGS. 7A, 7B, 8, and 9 to 128, and pixels in the vertical direction. This is a case where the number Yi is set to 128. Furthermore, the detection target is one person. In addition, as shown in FIG. 9, the resizing function 230 of the model preprocessing means 200 is used to resize three types (three types L) of a standard size image 232, a 30% reduced image 231, and a 30% enlarged image 233. For each of the three processed resized images, the position shift function 220 of the model preprocessing means 200 is used to move 32 times ( N=32), the position is shifted 32 times (M=32) in the vertical direction, and a total of 3×32×32 model input images 210 are generated. The analysis results shown in FIGS. 10 and 11 are based on the YOLO model 360 (object detection model) shown in FIGS. 3A and 3B, which has 128 input pixels in the horizontal direction and 128 pixels in the vertical direction. 300) and the model post-processing means 400, the position information 401 including the second detection frame for one person and the second After calculating the likelihood information 402 of a likelihood distribution 540 indicating dispersion, an average likelihood 501 that is the average value of the valid region of likelihood, a histogram 550 of likelihood, and a standard deviation of likelihood 502 that is the standard deviation of the valid region of likelihood; This is the result of calculating the maximum likelihood 503, which is the maximum value of the valid region of likelihood, and the minimum likelihood 504, which is the minimum value of the valid region of likelihood. Note that the likelihood distribution 540, average likelihood 501, likelihood histogram 550, likelihood standard deviation 502, maximum likelihood 503, and minimum likelihood 504 are based on the maximum likelihood value “1”. It may also be expressed as a percentage (%) of "100%". In FIG. 10 and FIG. 11, the likelihood is expressed in percentage (%). Note that it is also possible to directly process the value as a decimal number without converting it to a percentage. Although not shown in this example, it is possible to refer not only to the distribution for the likelihood in the second likelihood information 402 but also to the position information 621 including the correct detection frame from the annotation means 620 in FIG. In this case, the IOU distribution and statistical results for the IOU value 505 with the position information 401 including the second detection frame may be calculated. Although not shown in this example, if multiple classes other than people are to be identified, not only the distribution of the likelihood in the second likelihood information 402 but also the distribution of the likelihood in the second likelihood information 402 The class identification distribution and statistical results for the class identification information in may be calculated.

As shown in FIG. 9, the likelihood distribution 541 shown in FIG. 10 is obtained by using the model input image 232 as a reference image and reducing it by 30% (L=1) using the resizing function 230 under the instructions of various processing parameters 511. After processing the model input image 231 shown in FIG. 9, as shown in FIG. A plurality of position-shifted model input images (M=32) are input to the robustness verification means 500, which includes a YOLO model 360 (object detection model 300), a model post-processing means 400, and a probability statistical calculation means 520. It is what was done. At this time, various processing parameters 511 to 513 shown in FIG. It is linked to the information 402, and may be used when the probability statistical calculation means 520 calculates analysis results for each of various processing parameters.

Similarly, the likelihood distribution 542 is calculated using the various processing parameters 512 for the model input image 232 of the standard size (equal size) (L=2) shown in FIG. 543 is calculated using various processing parameters 513 for the model input image 233 enlarged by 30% (L=3) shown in FIG.

Likelihood distributions

541, 542, and 543 shown in FIG. 10 correspond to the level of likelihood (%) for fluctuations in the position (in pixels) of a person on the screen according to the gray scale bar 521 from white to black. It is displayed in shades of white (corresponding to 0% likelihood) to black (corresponding to 100% likelihood). Here, the likelihood (A) 522, the likelihood (B) 523, and the likelihood (C) of the likelihood distribution 542 in FIG. ) 524 and likelihood (D) 525 are based on the model input image (A) 221 processed by the position shift function 220 (in the case of S=1, N=32, M=32) shown in FIG. 8, respectively. The calculated likelihood, the likelihood calculated based on the model input image (B) 222, the likelihood calculated based on the model input image (C) 223, and the likelihood calculated based on the model input image (D) 224. This corresponds to a mapping of the likelihood calculated in . The

likelihood distributions

541, 542, and 543 indicate that the stronger the black level, the higher the likelihood, and conversely, the stronger the white level, the lower the likelihood. It is worth noting that in each likelihood distribution, between black levels with high likelihoods, there are areas in a specific grid-like pattern where gray or white levels are strong and the likelihood is low. I can confirm that there is. This result can be considered to be due to a phenomenon in which the likelihood varies irregularly due to fluctuations due to the position of the detected object (one person in this example) in the screen as explained in FIGS. 7A and 7B. . In addition, if there is a specific grid-like pattern as shown in this example, there is a high possibility that there is a problem with the object detection model 300 itself, and the likelihood of a given region is low. In this case, it can be considered that there is a high possibility that the learning of the model learning dictionary 320 is insufficient. Detailed factor estimation regarding the fluctuation of the likelihood due to the position of the detected object will be mentioned in conjunction with the explanation of FIG. 11.

Note that the aforementioned parameters S, N, and M, which are the various processing parameters 510 of the position shift function 220, may be changed depending on the use and purpose. Note that S, which is the pixel step setting, may be set to different values in the horizontal direction and the vertical direction. Setting S to a small value has the advantage of allowing detailed verification, but has the disadvantage of increasing calculation processing time. The processing parameters for position shifting N times in the horizontal direction and M times in the vertical direction are preferably set to appropriate values that allow verification of positional fluctuations, depending on the structure of the object detection model 300.

Next, the likelihood histogram 551 shown in FIG. 11 is obtained by normalizing the frequency of the likelihood (%) calculated by the probability statistical calculation means 520 for the likelihood distribution 541 shown in FIG. .0). Further, the statistical result 561 displays the average likelihood (%), standard deviation of the likelihood (%), maximum likelihood (%), and minimum likelihood (%) for the likelihood distribution 541. Furthermore, the likelihood 571 of the conventional method corresponds to the likelihood calculated by the conventional first performance indexing device 30 described above, which is a pinpoint value of the model input image 231 serving as a reference image for position shift shown in FIG. The calculated likelihood is displayed. Similarly, likelihood histograms 552 and 553, statistical results 562 and 563, and likelihoods 572 and 573 of the conventional method shown in FIG. 11 correspond to

likelihood distributions

542 and 543 shown in FIG. 10, respectively. be.

The average likelihood (%) in the

statistical results

561, 562, and 563 is an index for verifying the average detection accuracy and detection performance with respect to fluctuations depending on the position in the screen, and the higher the value, the more the model learning dictionary 320 is included. It can be considered that the object detection model 300 has high performance. Further, the standard deviation (%) of the likelihood is an index indicating the dispersion of the likelihood with respect to fluctuations depending on the position in the screen, and it can be considered that the smaller the standard deviation (%), the higher the stability of the object detection model 300 including the model learning dictionary 320. . On the other hand, if the standard deviation (%) of the likelihood is large, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient. is possible. Furthermore, by checking the

likelihood distributions

541, 542, and 543 explained in FIG. 10, it is possible to verify which factor is stronger. Furthermore, by verifying the maximum likelihood (%) and the minimum likelihood (%), it is also possible to determine whether the dispersion of the likelihood is close to a normal distribution. It can be considered that the higher the maximum likelihood (%) and the minimum likelihood (%), the higher the performance of the object detection model 300 including the model learning dictionary 320. On the other hand, if it becomes extremely low, either there is a problem with the object detection model 300 itself, or the learning of the model learning dictionary 320 for the detected object position on the screen is insufficient.

Note that this example shows the case where the detection target is one person, but if there are multiple detection targets or there are multiple objects of classes other than people, the likelihood is calculated for each detection target. It may be possible to calculate a degree distribution and its statistical results, an IOU distribution and its statistical results, a class identification distribution and its statistical results.

Using FIGS. 10 and 11 showing the verification results calculated by the performance indexing device 10 in object detection according to the first embodiment of the present invention, the model input is resized into three types in which one person exists as the detected object. Detailed detection accuracy and detection performance of the YOLO model 360 (object detection model 300) shown in FIGS. 3A and 3B, which consists of 128 input pixels in the horizontal direction and 128 pixels in the vertical direction in the image 210. An example of the verification method used when performing verification, issue analysis, and factor analysis is shown below.

An example of the verification method described in this example is the implementation of an electronic circuit as a means of operating the YOLO model 360 in order to miniaturize, save power, and reduce costs of cameras for detecting objects. The image size input to YOLO Model 360 may not be the original recommended size due to limitations in area or power consumption, limitations in memory capacity, or limitations in the performance of arithmetic processors such as the DSP (digital signal processor) installed. This is a verification result assuming a case where the input image size of the YOLO model 360 has to be made smaller than that of the currently used YOLO model 360, and does not always occur with the various recommended variations of the YOLO model 360.

As mentioned above, in the

likelihood distributions

541, 542, and 543 shown in FIG. 10, between the black levels with high likelihoods, the gray or white levels have a strong likelihood in a particular grid-like pattern. It can be confirmed that there is a region where the value is low. Therefore, as explained in FIG. 7A, even when the same object is detected, if the position of the detected object in the image fluctuates, a phenomenon occurs in which the likelihood, which is one of the detection results, varies greatly. Conceivable. Here, the specific grid pattern seen in the

likelihood distributions

541 and 542 is characterized by a pattern of about 8 pixels square, and the specific grid pattern seen in the likelihood distribution 543 is characterized by a pattern of about 16 pixels square. It is characterized by a pattern of One of the reasons why the characteristics of these patterns are different is that depending on the size of the detected object, it is either detected by the second detection layer 362 of the YOLO model 360 shown in FIG. 3A or detected by the third detection layer 363. The difference is thought to be due to the nature of the situation. It is considered that the

likelihood distributions

541 and 542 on the side where the size of the person to be detected is smaller are mainly output from the detection results of the third detection layer 363. Furthermore, it is considered that the likelihood distribution 540 on the side where the person size is large is mainly the result of the detection of the second detection layer 362. On the other hand, the reason why the likelihood decreases in a specific grid pattern is that the grid cell (7x7 example is used in FIGS. 3A and 3B) in step 314 of calculating the conditional class probability shown in FIG. (shown). When using a one-stage DNN model such as YOLO, which is said to have a high processing speed, the region can be set to any size in order to detect the position of the object and identify the class (classification) at the same time. In order to calculate the conditional class probability Pr(Classi|Object) 315 by dividing it into grid cells of Since both are multiplied when calculating the score) 317, the position of the person in the image is determined by the Confidence score 317 (corresponding to the likelihood) on the boundary depending on the grid cell structure of the conditional class probability. This is thought to be linked to the phenomenon in which the likelihood of a particular grid pattern decreases in response to fluctuations in the lattice pattern. If there is a specific grid-like pattern as shown in the verification results of this example, there is a high possibility that there is a problem with the object detection model 300 itself or the algorithm. It becomes more likely that potential problems of 300 itself can be identified and solutions can be formulated. Furthermore, it is possible to accurately determine whether the model learning dictionary 320 has incomplete versatility or robustness with respect to various variation conditions.

Furthermore, a more detailed verification will be performed using FIG. Considering the original characteristics of YOLO, it is known that detection accuracy and performance deteriorate as the size of a person becomes smaller or as the distance between the person increases. However, improvements have been reported in the new version of YOLO. It is assumed that the YOLO model 360 used in this example employs an unimproved version. First, when comparing the numerical values of likelihoods (%) 571, 572, and 573 of the conventional method, which are the likelihoods calculated by the first conventional performance indexing device 30, as the person size increases, 70. It changes to 12%, 49.27%, and 94.45%. Here, the likelihood (%) 572 of the conventional method, which is the standard size, is 49.27%, which is much lower than the likelihood (%) 571 of the conventional method when the size is reduced by 30%. For this reason, simply checking this result may lead to an erroneous conclusion that the learning of the model learning dictionary 320 for a person of the reference size is insufficient, and unnecessary additional learning may be performed. On the other hand, the likelihood (%) 571 of the conventional method when reduced by 30% is 70.12%, which is considered a passing score, and the fact that additional learning is not performed in the first place reduces the versatility of the model learning dictionary 320. It is also conceivable that the robustness enhancement will be insufficient.

On the other hand, the results calculated by the performance indexing device 10 in object detection according to the first embodiment of the present invention will be verified. The likelihood histograms 551, 552, and 553 indicate at what level the likelihoods of the

likelihood distributions

541, 542, and 543 in FIG. 10 exist. It can be considered that the performance is better when the occurrence frequency is concentrated on the right end where the likelihood is high. Also, it can be considered that the less variation there is, the more stable it is. As far as I checked the likelihood histograms 551, 552, and 553, unlike the likelihoods (%) 571, 572, and 573 of the conventional method, the likelihoods (%) are distributed in descending order of person size. I understand that. When checking the

statistical results

561, 562, and 563, which are the results of statistical analysis of the

likelihood distributions

541, 542, and 543 and the likelihood histograms 551, 552, and 553 shown in FIG. Unlike the results confirmed with the likelihood (%) of 571, 572, and 573, the average likelihood (%) becomes closer to the original ideal as the person size increases, 60.85% < 71. It can be seen that the rates increase in the order of 82% < 89.98%. Therefore, the results of likelihood (%) 571, 572, and 573 of the conventional method have the problem that the detection results are blurred due to fluctuations in the position of the person in the image depending on the specific grid pattern. was confirmed.

Furthermore, by utilizing the learning reinforcement necessary item extraction means 530 provided in the robustness verification means 500, the development goal of the model learning dictionary 320 was to increase the average likelihood (%) to 70% or more, for example. In this case, by setting the average likelihood (%) threshold to 70%, in the case of a size reduced by 30%, the likelihood (%) result of the conventional method seemed to be achieved by chance. However, in reality, it is less than the threshold, which is more than 9% short, so it becomes possible to find out that reinforcement through additional learning is necessary for the person who has been reduced by 30%. For example, if you set the threshold for the standard deviation (%) of the likelihood to 10% and check the standard deviation (%) of the likelihood that is 10% or more, you will find that the standard deviation (%) that exceeds 10% The person and the person reduced by 30% are extracted as targets. As a result, it can be confirmed that it is necessary to strengthen the model learning dictionary 320 with respect to fluctuations in the position on the screen for objects corresponding to a person of standard size and a person reduced by 30%. Furthermore, by referring to other verification results such as

likelihood distributions

541 and 542 and

histograms

561 and 562, it is possible that a potential decrease in likelihood depending on the above-mentioned DNN configuration and algorithm has occurred. This makes it possible to notice the need for improvement or reinforcement due to the nature of the problem. Similarly, maximum likelihood (%) and minimum likelihood (%) can be used as materials for various judgments. For example, if the minimum likelihood threshold is set to 30%, the standard size that will be 30% or less Regarding a person and a person reduced by 30%, if the object position stops at that position, there is a risk that it may become undetectable, so it is also possible to extract these potential issues and problems in advance. . Furthermore, it is possible to link this to the formulation of methods for improving the object detection model 300, model preprocessing means 200, and model postprocessing means 400.

As an example, in FIG. 12, a model input image 526 of 128 pixels in the horizontal direction and 128 pixels in the vertical direction in which a person is located far away (at the top of the screen) is used as a reference image, and the position of the model preprocessing means 200 is shifted. For a total of 64 x 64 model input images 210 whose positions have been shifted by 1 pixel step (S = 1) by the function 220 by 64 times (N = 64) in the horizontal direction and 64 times (M = 64) in the vertical direction. The results of calculating the likelihood distribution 544 by the robustness verification means 500 including the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 are shown. The likelihood distribution 544 changes from white (equivalent to 0% likelihood) according to the level of likelihood (%) for fluctuations in the position (in pixels) of the person on the screen, according to the gray scale bar 521 from white to black. It is displayed in shades of black (likelihood 100%). By expanding the verification range of fluctuations due to the position on the screen for one person, the upper side of the likelihood distribution 544, that is, the area 527 surrounded by the dotted line, is an area where the white level is stronger and the likelihood is lower than other areas. I understand that there is something. In this example, an area 527 surrounded by a dotted line can be considered to indicate a case where a person exists in an area 528 surrounded by a dotted line that extends to the lower right side of the center of the person in the model input image 526. The above-mentioned specific grid pattern can also be observed, but the area 527 surrounded by the dotted line has a particularly low likelihood, so the learning reinforcement necessary item extraction means 530 concentrates it and reduces the likelihood. If it is a region, it can be extracted. Therefore, if the person in the model input image is located in the area 528 surrounded by the dotted line, it can be confirmed that the object detection ability is low, and it can be realized that the model learning dictionary 320 needs to be strengthened. . Therefore, the model learning dictionary 320 can be efficiently strengthened by the dictionary learning means 600 of Embodiment 2, which will be described later. Leads to reinforcement.

Note that the verification method in this example shows the case where the detection target is one person, but if there are multiple detection targets or if there are multiple objects of classes other than people, the verification method for each detection target is Then, a model learning dictionary is created based on the detected objects and judgment conditions extracted using the learning reinforcement necessary item extraction means 530 for the likelihood distribution and its statistical results, the IOU distribution and its statistical results, and the class identification distribution and its statistical results. 320 reinforcement targets may be specified. Furthermore, problems for the object detection model 300 may be extracted. Furthermore, the versatility and robustness of the model learning dictionary 320 may be enhanced by a dictionary learning means 600, which will be described later, with reference to the extracted information 531.

Although this example describes a case in which specific restrictions are applied to YOLO, the object detection model 300 may be applied to a DNN model such as the same one-stage SDD. Furthermore, the present invention may be applied to a two-stage DNN model such as EfficientDet, which processes object position detection and class identification in two stages. Furthermore, it may be applied to object detection models and machine learning models that do not use neural networks.

Performance indexing device 10 in object detection of the present invention using the image processing means 100, model pre-processing means 200, object detection model 300, model post-processing means 400, and robustness verification means 500 described in Embodiment 1 so far As a result, the following usefulness and effects can be expected.

According to an embodiment, the object detection model is created by checking the likelihood distribution 540 for the position of each detected object with respect to the plurality of model input images 210 processed by the position shift function 220 of the model preprocessing means 200. This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in the position of the detected object on the screen due to the potential problems that the neural network itself including the DNN model in the object detection model has. This makes it possible to accurately identify issues related to accuracy and performance during inference. Furthermore, since it is possible to effectively formulate methods and methods for solving problems, it is possible to improve the detection accuracy and detection performance of the object detection model.

According to an embodiment, the robustness verification means 500 further calculates a likelihood distribution 540 indicating the dispersion due to the position shift of each detected object, an average likelihood 501 that is the average value of the valid area of the likelihood, and a likelihood a histogram 550, a standard deviation of likelihood 502 which is the standard deviation of the valid area of likelihood, a maximum likelihood 503 which is the maximum value of the valid area of likelihood, and a minimum value which is the minimum value of the valid area of likelihood. By providing a probability statistical calculation means 520 that calculates either or both of the likelihood 504 and the IOU value 505 for the likelihood, the detected object position on the screen can be This makes it possible to extract features whose likelihoods fluctuate due to fluctuations in It becomes possible to extract Furthermore, since methods and methods for solving problems can be formulated more effectively, detection accuracy and detection performance of the object detection model 300 can be further improved. Furthermore, when combined with various machining parameters 510 other than position shift, the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.

According to an embodiment, if there is position information 621 including a detection frame that is a correct answer and class identification information 622 that is a correct answer, the robustness verification means 500 further detects variations due to position shifts for each detected object. IOU distribution and class identification accuracy rate distribution showing the average IOU value, average class identification accuracy rate, histogram of IOU value, histogram of class identification accuracy rate, standard deviation of IOU value, standard of class identification accuracy rate The object detection model has Due to potential problems, it becomes possible to extract features in which position information including the detection frame and class identification information fluctuate due to fluctuations in the position of the detected object on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently. Furthermore, since it is possible to more effectively formulate methods and methods for solving problems, it is possible to further improve the detection accuracy and detection performance of the object detection model. Furthermore, when combined with various machining parameters 510 other than position shift, the DNN model can be used to eliminate weaknesses and enhancement policies in versatility and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. This makes it possible to accurately understand problems that may exist within the neural network itself. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary.

According to an embodiment, the model preprocessing means 200 further generates an enlarged or reduced image using L (arbitrary integer) types of arbitrary magnification as various processing parameters 510, and then performs the above-mentioned position shift. By generating the image, the robustness verification means 500 including the probability statistical calculation means 520 calculates the likelihood distribution 540 for the position of each detected object for each L size, the average likelihood 501 of the valid area of the likelihood, and the likelihood distribution 540 for the position of each detected object for each L size. It becomes possible to check the histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505. Furthermore, check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each of the L types of sizes, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. becomes possible. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary 320.

According to an embodiment, the model post-processing means 400 further includes an individual identification means 410, thereby eliminating abnormal data and correcting position information and likelihood information including a detection frame for each detected object to suitable information. Therefore, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, the histogram 550 of the likelihood, the standard deviation of the likelihood 502, the maximum likelihood 503, the minimum likelihood 504, It becomes possible to calculate the IOU value 505 more accurately. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.

According to an embodiment, if there is position information 621 and class identification information 622 including the correct detection frame for each detected object, the model post-processing means uses the individual identification means 410 to eliminate abnormal data. Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each detected object, the likelihood distribution 540 for the position of each detected object, the average likelihood 501 of the effective area of the likelihood, and the likelihood It becomes possible to accurately calculate the degree histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data. Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for each detected object position and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are compared with the correct data for accuracy. It is possible to check. Therefore, it is possible to improve the DNN model and enhance the versatility and robustness of the model learning dictionary 320 more accurately.

According to an embodiment, the model post-processing means 400 further associates each output result with the various machining parameters 510 for each detected object and outputs the results to the robustness verification means. Separately, due to potential problems with object detection models, it becomes possible to extract features whose likelihoods fluctuate due to fluctuations in the position of detected objects on the screen. Therefore, it is possible to more accurately extract problems related to accuracy and performance during inference that the neural network itself including the DNN model in the object detection model has latently.

According to an embodiment, the robustness verification means 500 further extracts a position or region that is equal to or less than an arbitrary threshold value in the likelihood distribution 540 for each detected object, and extracts the average likelihood 501 for each of the various processing parameters 510. extraction of detected objects for which the standard deviation 502 of the likelihood is equal to or greater than an arbitrary threshold; extraction of detected objects for which the maximum likelihood 503 is equal to or less than an arbitrary threshold; It has a learning reinforcement necessary item extraction means 530 that includes either or all of the following: extracting a detected object whose likelihood 504 is equal to or less than an arbitrary threshold; and extracting a detected object whose IOU value 505 is equal to or less than an arbitrary threshold. By separating the weak points and reinforcement policies of the generality and robustness against various fluctuation conditions caused by the model learning dictionary 320 created by deep learning etc. from the potential problems of the neural network itself including the DNN model, and further It becomes possible to understand accurately. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.

According to an embodiment, the robustness verification means 500 further extracts a position or region where the IOU distribution is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and determines the class identification accuracy rate for each detected object, for each of the various processing parameters 510. Extraction of a position or region in the distribution where the value is below an arbitrary threshold value, extraction of detected objects whose average IOU value is below an arbitrary threshold value, extraction of detected objects whose average class identification accuracy rate is below an arbitrary threshold value, and IOU Extraction of detected objects whose standard deviation of values is greater than or equal to an arbitrary threshold, extraction of detected objects whose standard deviation of class classification accuracy rate is greater than or equal to an arbitrary threshold, and extraction of detected objects whose maximum IOU value is less than or equal to an arbitrary threshold. Extraction, Extraction of detected objects whose maximum IOU value is below an arbitrary threshold, Extraction of detected objects whose minimum IOU value is below an arbitrary threshold, Detection whose minimum class identification accuracy rate is below an arbitrary threshold The model learning dictionary 320 is created by deep learning or the like based on the position information including the detection frame and the class identification information by having the learning enhancement necessary item extraction means 530 that includes any or all of the object extraction functions. It becomes possible to more accurately understand the weaknesses in generality and robustness against various fluctuation conditions caused by this, as well as the strengthening policy, by separating them from the potential problems of the neural network itself, including the DNN model. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.

According to an embodiment, the probability statistical calculation means 520 of the robustness verification means 500 and the learning reinforcement necessary item extraction means 530 further perform probability statistical calculation based on the likelihood, IOU value, and class identification correct answer rate. In this case, by providing a function that excludes images that are missing pixels related to the target detected object at an arbitrary rate from the calculation target, the object in the image that serves as the reference for verification is provided. Even if the effective range of the object to be detected is missing depending on the position of the object and the position of the object after processing various processing parameters 510 of the model preprocessing means 200, the performance and characteristics of the object detection model 300 are accurate. It becomes possible to verify the versatility and robustness of the learning dictionary 320. Therefore, it is possible to improve the DNN model with respect to the detected object size and to enhance the versatility and robustness of the model learning dictionary 320.

Furthermore, some examples will be described regarding variations in the processing by the model preprocessing means 200 and the verification method by the robustness verification means 505 using various processing parameters 510.

According to an embodiment, when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further includes P (arbitrary integer) types of contrast correction as various processing parameters 510. It may be characterized by generating an image with the brightness level changed to an arbitrary value using a curve or a gradation conversion curve. Furthermore, after changing the gradation, the image is shifted horizontally by N (any integer) times and vertically by M (any integer) times in S (any decimal) pixel steps. The image forming apparatus may include a position shift function 220 that generates a total of N×M×P tone-converted and position-shifted model input images 210. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the brightness level using a contrast correction curve or a gradation conversion curve, the function may be implemented by being executed by the image processing processor 290.

As an example, FIG. 13 shows a standard brightness level image 262 obtained by applying a gradation conversion curve 265 (P=2) that maintains the state to a standard brightness level image taken on a normal sunny day. and brightness processed as a result of applying a gradation conversion curve 264 (P=1) that simulates rainy or cloudy weather conditions with low illuminance, dawn, dusk, and night time periods, and crushed blacks. A low brightness level image 261 with a lower level and a gradation conversion curve 266 (P=3) that simulates clear weather conditions with high illuminance, backlighting, overexposure, and a shooting studio illuminated with strong light. A case is shown in which three types of gradation-converted images, such as a high-luminance-level image 263 with a higher luminance level processed as a result of application, are generated. For each of the

images

261, 262, and 263, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.

A plurality of model input images 210 processed by the position shift function 220 and tone conversion function 260 of the model preprocessing means 200 as shown in FIGS. 8 and 13 are combined with the object detection model 300 shown in FIG. After the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210, the general purpose of the object detection model 300 is calculated based on the various processing parameters 510. 10 and 11 for P (arbitrary integer) types of contrast correction curves or gradation conversion curves that are various processing parameters 510. The probability statistical calculation means 520 as described above generates a likelihood distribution 540 indicating the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, and a likelihood histogram 550. , the standard deviation of likelihood 502 which is the standard deviation of the valid region of likelihood, the maximum likelihood 503 which is the maximum value of the valid region of likelihood, and the minimum likelihood 504 which is the minimum value of the valid region of likelihood. It may also be something that calculates.

Furthermore, if the position information 621 including the correct detection frame and the correct class identification information 622 exist, the IOU value 505 may be calculated.

Furthermore, the distribution, histogram, standard deviation, maximum value, minimum value of the IOU value for each detected object position for each P (arbitrary integer) type of contrast correction curve or gradation conversion curve, and the distribution of the class identification accuracy rate. , histogram, standard deviation, maximum value, and minimum value may be calculated.

Furthermore, the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 described above may be provided.

Since the model preprocessing means 200 is equipped with a gradation conversion function 260, it is possible to improve the DNN model for the brightness levels of the detected object and background that change depending on weather conditions, shooting time, and illuminance conditions of the shooting environment, and to use a general-purpose model learning dictionary. This makes it possible to enhance performance and robustness.

According to an embodiment, when processing the plurality of model input images 210 input to the object detection model 300, the model preprocessing means 200 further uses Q (arbitrary integer) types of aspect ratios as various processing parameters 510. may be used to generate an image with a changed aspect ratio. After changing the aspect ratio, use a position shift of N (any integer) horizontally and M (any integer) vertically in S (any decimal) pixel steps. , a position shifting function 220 that generates a model input image 210 with a total of N×M×Q aspect ratio changes and position shifts. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the aspect ratio using Q types of aspect ratios, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.

As an example, in FIG. 14, one person in the reference model input image 252 (Q=2) is reduced by 30% in the vertical direction so that it has an aspect ratio that simulates a child of a certain age or a plump person. A case in which images are generated with three different aspect ratios: a model input image 251 (Q = 1) that has been reduced, and a model input image 252 that has been reduced by 30% in the horizontal direction to have an aspect ratio that simulates a slender person. show. For each of the

images

251, 252, and 253, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.

A plurality of model input images 210 processed by the position shift function 220 and aspect ratio change function 250 of the model preprocessing means 200 as shown in FIGS. 8 and 14 are combined with the object detection model 300 shown in FIG. After the processing means 400 calculates the position information 401 including the second detection frame and the second likelihood information 402 for each of the plurality of model input images 210, the general purpose of the object detection model 300 is calculated based on the various processing parameters 510. The input data is input to the robustness verification means 500 that verifies the stability and robustness, and the probability statistical calculations as explained in FIGS. The means 520 generates a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and an effective area of the likelihood. Even if it calculates the standard deviation of likelihood 502 which is the standard deviation, the maximum likelihood 503 which is the maximum value of the valid area of likelihood, and the minimum likelihood 504 which is the minimum value of the valid area of likelihood. good.

Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each aspect ratio of Q (arbitrary integer) types, and the distribution, histogram, standard deviation, and maximum value of the class identification accuracy rate. , the minimum value may be calculated.

By equipping the model preprocessing means 200 with the aspect ratio changing function 250, it is possible to improve the DNN model for various aspect ratios (ratios) of the detected object and to enhance the versatility and robustness of the model learning dictionary. Become.

According to an embodiment, the model preprocessing means 200 further processes R (arbitrary integer) types of angles as various processing parameters 510 when processing the plurality of model input images 210 input to the object detection model 300. It may also be characterized in that it is used to generate an image with a changed rotation angle. After changing the rotation angle, the image is moved in S (any decimal) pixel steps using a position shift of N (any integer) horizontally and M (any integer) vertically. , a total of N×M×R rotation angle changes and a position shift function 220 that generates a position-shifted model input image 210 may be provided. Further, it may be provided with a function of cutting out an arbitrary area. Note that when changing the rotation angle using R types of angles, the function may be realized by executing the affine transformation function 291 or the projective transformation function 292 in the image processing processor 290.

As an example, in FIG. 15, a model input image 242 (R=2) serving as a reference angle for one person is rotated 45 degrees to the left to simulate differences in the mounting position of the camera, etc., and the pose of the person. A case is shown in which images are generated with three different rotation angles: an image 241 (R=1) and a model input image 243 rotated by 45° to the right to simulate the difference in the mounting position of a camera or the pose of a person. For each of the

images

241, 242, and 243, as shown in FIG. 8, N×M position-shifted images are generated in S pixel steps, and a total of 3×N×M multiple model input images 210 are generated. It may also be something that processes.

A plurality of model input images 210 processed by the position shift function 220 and rotation function 240 of the model preprocessing means 200 as shown in FIGS. 8 and 15 are combined with the object detection model 300 shown in FIG. 400, the position information 401 including the second detection frame and the second likelihood information 402 are calculated for each of the plurality of model input images 210, and then the versatility and the object detection model 300 are calculated based on various processing parameters 510. The robustness is input to the robustness verification means 500 for verifying robustness, and the probability statistical calculation means 520 as explained in FIGS. , a likelihood distribution 540 showing the variation due to the position shift of one person, an average likelihood 501 which is the average value of the effective area of the likelihood, a histogram 550 of the likelihood, and a standard deviation of the effective area of the likelihood. It may be possible to calculate a standard deviation 502 of a certain likelihood, a maximum likelihood 503 which is the maximum value of the valid region of likelihood, and a minimum likelihood 504 which is the minimum value of the valid region of likelihood.

Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object for each R (arbitrary integer) type of angle, and the distribution, histogram, standard deviation, maximum value, and class identification accuracy rate, The minimum value may be calculated.

By equipping the model preprocessing means 200 with the rotation function 240, it becomes possible to improve the DNN model for various rotation angles of the detected object and to enhance the versatility and robustness of the model learning dictionary.

According to an embodiment, the model preprocessing means 200 processes the plurality of model input images 210 inputted to the object detection model 300 by processing 281 to 281 in FIGS. 8, 9, 14, and 15. Calculate the average brightness level of the valid image in the blank space where no valid image exists due to the position shift process, resize process, aspect ratio change process, or rotation process shown in 288, and make the average brightness level uniform. It may also include a margin padding function 280 for pasting to generate an image. Alternatively, the blank space may be interpolated using the effective image area existing in the output image of the image processing means 100. Furthermore, the blank space may be filled with images that do not affect learning or inference.

Since the model preprocessing means 200 is equipped with the margin padding function 280, the influence of features including margins on the inference accuracy of the object detection model 300 can be reduced. It becomes possible to more accurately calculate the average likelihood 501 of the effective region of likelihood, the histogram 550 of likelihood, the standard deviation 502 of likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505. Furthermore, it is possible to more accurately check the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. It becomes possible. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary.

According to an embodiment, the various processing parameters 510 used for processing by the model preprocessing means 200 include a resizing function 230 involving the position shift function 220 described above, a rotation function 240, an aspect ratio changing function 250, and a gradation function. It is also possible to perform a plurality of processes in which the conversion functions 260 are intertwined with each other. Further, the probability statistical calculation means 520 and the learning reinforcement necessary item extraction means 530 of the robustness verification means 500 may be used as a method for analyzing the interdependence of the plurality of various processing parameters 510. Furthermore, although the explanation was omitted in the first embodiment, various types of It becomes possible to improve the DNN model with respect to distortion of the detected object and background, and to enhance the versatility and robustness of the model learning dictionary.

Using the performance indexing device 10 for object detection according to an embodiment, the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, we will improve the object detection model 300 and repeatedly perform deep learning using the dictionary learning means 600 described later in Embodiment 2 to improve the object detection model 300 and solve and strengthen it. It becomes possible to realize object detection that is highly versatile and robust even under various conditions.

(Embodiment 2)
FIG. 16 is a block diagram showing a performance indexing device 20 for object detection in an image according to Embodiment 2 of the present invention. Here, the image processing means 100, the image output control means 110, the display and data storage means 120, the model pre-processing means 200, the object detection model 300, the model post-processing means 400, the robustness verification means 500, and the equipment provided therefor. Each means, each function, each process, each step, and each device, each method, each program, etc. for realizing them are the same as those in Embodiment 1, so they will be explained in the text of Embodiment 2. omitted. In addition, each means, each function, each process, each step, each apparatus, each method, each program, etc. of the other embodiments described in Embodiment 1 may be used and implemented.

Note that each means, each function, and each process described in Embodiment 2 of the present invention described later may be replaced with a step, and each device may be replaced with a method. Moreover, each means and each device described in Embodiment 2 of the present invention may be realized by a program operated by a computer.

First, an example of the dictionary learning means 600, which is deep learning for creating the model learning dictionary 320, which is one of the components of the object detection model 300, will be described.

First, learning material data that is considered appropriate for the purpose of use is extracted from the learning material database storage means 610 in which material data (image data) for deep learning is stored. The material data for learning stored in the learning material database storage means 610 may be one that utilizes a large-scale open source dataset such as COCO (Common Object in Context) or Pascal VOC Dataset. In addition, there are cases in which, for example, image processing means 100 displays a necessary image according to the purpose of use using image output control means 110, and image data stored in data storage means 120 is utilized.

Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the learning material data extracted from the learning material database storage unit 610 to create supervised data. Here, open source datasets such as COCO and Pascal VOC Dataset may be used directly as supervised data without using the annotation means 620 if the data has already been annotated.

Next, the supervised data is augmented by an Augment means 630 as a learning image 631 to enhance versatility and robustness. Here, the Augment means 630 is, for example, a means for shifting an image to an arbitrary position in the horizontal and vertical directions, a means for enlarging or reducing an image to an arbitrary magnification, a means for rotating an image to an arbitrary angle, and a means for changing the aspect ratio. It is equipped with a means for changing the ratio and a dewarping means for performing distortion correction, cylindrical conversion, etc., and the image is padded by combining various means depending on the purpose of use.

Next, the training image 631 padded by the Augment means 630 is input to the deep learning means 640 to calculate the weighting coefficients of the DNN model 310, and the calculated weighting coefficients are converted into the ONNX format for example. A learning dictionary 320 is created. Note that the model learning dictionary 320 may be created by converting into a format other than ONNX format. Here, for example, when YOLO is applied to the DNN model 310, the deep learning means 640 is realized by an open source learning environment called darknet and an arithmetic processor (including a personal computer and a supercomputer). The darknet has learning parameters called hyperparameters, and it is possible to set appropriate hyperparameters depending on the usage and purpose, and to strengthen versatility and robustness in conjunction with the augment means 630. is also possible. By reflecting the model learning dictionary 320 created by the deep learning means 640 on the object detection model 300, it becomes possible to detect the position of the object in the image and identify the class. Note that the deep learning means 640 may be configured by an electronic circuit. Note that a learning environment configured using a programming language may be used depending on the DNN model 310 to be applied.

Next, an example of the performance indexing device 20 in object detection for analyzing the robustness and reinforcement policy of the model learning dictionary 320 of a model that detects the position of an object in an image and identifies its class will be described.

Validation material data for verifying detection accuracy, detection performance, versatility, and robustness required for the purpose of use is extracted from the aforementioned learning material database storage means 610. The image data for validation stored in the learning material database storage means 610 is obtained by utilizing a large-scale open source validation image dataset such as COCO (Common Object in Context) or Pascal VOC Dataset. Anything is fine. In addition, images for verifying the detection accuracy, detection performance, versatility, and robustness necessary for the purpose of use are displayed and sent to the data storage means 120 from the image processing means 100 using the image output control means 110, for example. Stored image data may also be used.

Next, the annotation unit 620 adds class identification information and groundtruth BBox, which is a correct answer frame, to the validation material data extracted from the learning material database storage unit 610 to create validation data 623. Here, open source datasets such as COCO and Pascal VOC Dataset may be used directly as validation data 623 without using the annotation means 620 if the data has already been annotated.

Next, the validation data 623 is transferred to a second mAP calculation means 650 equipped with a model post-processing means 400 having inference (prediction) ability equivalent to that of the object detection model 300 and the individual identification means 410 described in the first embodiment. Calculation of IOU value 653 by comparing groundtruth BBox, which is the correct answer frame, and PredictedBBox (predicted BBox) calculated as a result of inference (prediction), and calculation of all prediction results for all validation data 623. Calculation of Precision 654, which indicates the percentage of the IOU values 653 that were correctly predicted above an arbitrary threshold value, and among the actual correct results, the IOU value 653 was above an arbitrary threshold value and the BBox in a position close to the correct result was predicted. Calculation of Recall 655 indicating the ratio, AP (Average Precision) value 651 for each class as an index for comparing the accuracy and performance of object detection mentioned above, and mAP (mean Average Precision) value 652 averaged over all classes. It may also be something that calculates. (For example, see Non-Patent Document 2) Here, for example, when YOLO is applied to the DNN model 310, the second mAP calculation means 650 uses an open source inference environment called darknet and an arithmetic processor (personal computer or super It is desirable that the object detection model 300 has the same inference (prediction) performance as the object detection model 300. Further, the IOU value is 653, Precision 654, Recall 655, AP value 651, and mAP value 652 calculation means may be provided.

These, the image processing means 100, the model preprocessing means 200, the object detection model 300, the model postprocessing means 400, the robustness verification means 500, the learning material database storage means 610, the annotation means 620, and the second A series of means for generating the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652 by the mAP calculation means 650 of This is a performance indexing device 20 in object detection according to the present invention for analyzing a reinforcement policy.

According to an embodiment, the individual identification means 410 of the model post-processing means 400 as described in FIGS. 6A and 6B of the first embodiment is provided with the second mAP calculation means of the second embodiment, so that abnormal data can be Since the position information including the detection frame and the likelihood information can be corrected to the optimal information for each exclusion and detection object, the likelihood distribution 540 for the position of each detection object and the average likelihood 501 of the effective area of the likelihood It becomes possible to accurately calculate the likelihood histogram 550, the standard deviation 502 of the likelihood, the maximum likelihood 503, the minimum likelihood 504, and the IOU value 505 by comparing them with the correct data. Therefore, it is possible to more accurately improve the DNN model and strengthen the versatility and robustness of the model learning dictionary. Therefore, it is possible to more accurately calculate the IOU value 653, Precision 654, Recall 655, AP value 651, and mAP value 652, which are indicators of overall and average inference accuracy and performance using validation data. The accuracy of indexing the overall object detection model 300 and model learning dictionary 320 is improved.

According to an embodiment, the robustness verification means 500 extracts a position or region where the likelihood distribution for each detected object is below an arbitrary threshold value, and extracts a position or region where the average likelihood 501 is below an arbitrary threshold value, for each of the various processing parameters 510. Extraction of detected objects for which the standard deviation 502 of the likelihood is greater than or equal to an arbitrary threshold value, extraction of detected objects for which the maximum likelihood 503 is less than or equal to an arbitrary threshold value, and extraction of detected objects for which the standard deviation 502 of the likelihood is equal to or less than an arbitrary threshold value, By having the learning enhancement necessary item extraction means 530 that includes any or all of the detection objects that are below the threshold of It becomes possible to more accurately understand weak points in robustness and strengthening policies for neural networks by separating them from potential problems that the neural network itself, including the DNN model, has. Therefore, it is possible to apply learning image data and supervised data that are more effective in deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.

Furthermore, the distribution, histogram, standard deviation, maximum value, and minimum value of the IOU value for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate are calculated using arbitrary thresholds. By providing the learning enhancement necessary item extraction means 530 that performs extraction, it becomes possible to further enhance the versatility and robustness of the model learning dictionary 320.

According to an embodiment, the likelihood distribution 540, average likelihood 501, likelihood histogram 550, likelihood standard deviation 502, maximum likelihood 503, minimum likelihood 504, and IOU calculated by the probability statistical calculation means 520. As a result of analysis based on the value 505, etc., if it is determined that the performance of the model learning dictionary 320 is insufficient, a learning image is prepared based on the result of the learning reinforcement required item extraction means 530, and a built-in or external dictionary is used. It may be characterized by re-learning by the learning means 600. By relearning the model learning dictionary 320, various processing parameters 510 other than position shift in an arbitrary range near the detected object (position such as left, right, top, bottom, and depth of the object in the screen, object size, contrast, gradation, aspect ratio, rotation, etc.), the neural network itself, including the DNN model, will be able to overcome weaknesses in versatility and robustness against various fluctuating conditions and enhancement policies caused by the model learning dictionary 320 created by deep learning etc. It will be possible to separate them from potential issues and accurately understand them. Therefore, it is possible to apply learning image data and supervised data that are effective through deep learning, etc., thereby making it possible to enhance the versatility and robustness of the model learning dictionary 320.

Furthermore, probability statistical calculation means 520 calculates the distribution, histogram, standard deviation, maximum value, and minimum value of IOU values for the position of each detected object, and the distribution, histogram, standard deviation, maximum value, and minimum value of the class identification accuracy rate. By providing the model learning dictionary 320, it is possible to further enhance the versatility and robustness of the model learning dictionary 320.

Using the performance indexing device 20 in object detection of a certain embodiment, the detection performance, detection accuracy, and versatility and robustness of the object detection model 300 and model learning dictionary 320, including variations and imperfections, are evaluated. Based on the results of verifying the problem, the object detection model 300 is improved, and the dictionary learning means 600 repeatedly performs deep learning to solve and strengthen the object detection model, resulting in higher detection ability and a general-purpose model that can be used under various fluctuating conditions. This makes it possible to realize object detection with high performance and robustness.

(summary)
FIG. 18 is a diagram illustrating a summary of the object detection model performance indexing device of the present invention. As shown in FIG. 18, the performance indexing device, method, and program for the object detection model of the present invention apply to image data generated by an image processing means that acquires an image including a detection target and processes it appropriately. , has a model preprocessing means that processes multiple images using various processing parameters such as size change and position shift, and inputs the multiple processed images to an object detection model equipped with a trained model learning dictionary. After the model post-processing means calculates the position information and likelihood information of the detection frame for each detected object from the inference information obtained, the robustness verification means calculates the average likelihood with respect to the object position fluctuation for each of the various processing parameters Performance indicators such as degree and standard deviation of likelihood are calculated. Furthermore, based on the results of the performance indexing, the dictionary learning means performs robust reinforcement of the model learning dictionary.

Although the performance indexing device and the like according to one or more aspects have been described above based on the embodiment, the present invention is not limited to this embodiment. Unless departing from the spirit of the present disclosure, various modifications that can be thought of by those skilled in the art to this embodiment, and forms constructed by combining components of different embodiments are also within the scope of one or more embodiments. may be included within.

For example, in the above embodiments, each component is configured with dedicated hardware, but it may also be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. Here, the software that implements the performance indexing device and the like of the above embodiment is the following program.

That is, this program is a program that causes a computer to execute a performance indexing method.

The present invention is useful in the technical field of identifying the position and class of an object in an image using an object detection model. Among these, it is particularly useful in the technical field of reducing the size, power consumption, and cost of cameras and the like for detecting objects.

10, 20 Performance indexing device 30 First performance indexing device 40 Second performance indexing device 100 Image processing means 101 Lens 102 Image sensor 103, 290 Image processing processor 110 Image output control means 120 Display and data storage means 200 Model preprocessing means 201, 202, 203, 204, 205, 206, 210, 221, 222, 223, 224, 231, 232, 233, 241, 242, 243, 251, 252, 253, 261, 262, 263, 311, 440, 470, 526 Model input images 207, 208, 209, 211, 212, 213, 401, 451, 452, 490, 491 Position information including second detection frame 214, 215, 216, 217, 218, 219, 453, 454, 492, 493 Likelihood in second likelihood information 220 Position shift function 230 Resize function 240 Rotation function 250 Aspect ratio change function 260 Tone conversion function 264, 265, 266 Tone conversion curve 270 Dewarp function 280 Margin padding function 281, 282, 283, 284, 285, 286, 287, 288 Margin portion 291 Affine transformation function 292 Projective transformation function 293 Distortion correction table 300 Object detection model 301, 441, 442, 443, 444, 471 , 472, 473, 474 Position information including the first detection frame 302 First likelihood information 310 DNN model 312 Step of estimating multiple Bounding BBoxes and reliability 313 Confidence
314 Step of calculating conditional class probability 315 Conditional class probability 316 Final detection step 317 Confidence score
318 Detection frame of position information including the first detection frame 320 Model learning dictionary 330 Artificial neuron model 340 Neural network 350 Activation function 351 Sigmoid function 352 ReLU
353 Leaky ReLU
360 YOLO model 361 First detection layer 362 Second detection layer 363 Third detection layer 364, 365

Upsampling layer

366, 367

Skip connection

370, 371, 372, 373, 374, 375, 376, 377, 378, 379 , 380, 381, 382, 383, 384, 385, 386, 387

Convolution layer

390, 391, 392, 393, 394, 395 Pooling layer 400 Model post-processing means 402 Second likelihood information 403 Detection result 410 Individual identification means 420, 427, 494, 495, 505, 653 IOU value 422 Area of Union
423 Area of Intersection
424 People 425 ground truth BBox
426 Predicted BBox
445, 446, 447, 448, 475, 476, 477, 478 Likelihood in the

first likelihood information

480, 481, 621 Position information including the detection frame that is the

correct answer

482, 483, 622 Class identification that is the correct answer Information 500 Robustness verification means 501 Average likelihood 502 Standard deviation of likelihood 503 Maximum likelihood 504

Minimum likelihood

510, 511, 512, 513 Various processing parameters 520 Probability statistical calculation means 521 White to black shading bars 522, 523, 524, 525

Likelihood

527, 528 Region 530 Learning reinforcement necessary item extraction means 531

Extraction information

540, 541, 542, 543, 544 Likelihood distribution 550, 551, 552, 553

Likelihood histogram

561, 562, 563 Statistical result 571 , 572,573 Likelihood of conventional method 600 Dictionary learning means 610 Learning material database storage means 620 Annotation means 623 Validation data 630 Augment means 631 Learning image 640 Deep learning means 650 Second mAP calculation means 651 AP (Average Precision ) value 652 mAP (mean average precision) value 654 Precision
655 Recall
660 First mAP calculation means S430, S460 Input step S431 Setting step S432, S435, S462 Comparison step S433 Deletion step S434 Mutual IOU value calculation step S436 Maximum likelihood determination step S437, S464 Output step S461 IOU value calculation with correct answer frame Step S463 Class identification determination step

Claims

A performance indexing device for an object detection model, comprising:
an image processing means for acquiring and appropriately processing images;
model preprocessing means for processing the image acquired by the image processing means into a plurality of images according to various processing parameters;
an object detection model including a model learning dictionary that infers object positions and likelihoods based on the input of the plurality of images processed by the model preprocessing means;
Based on the inference result of the object detection model, position information and first likelihood information including the first detection frame are determined for each detected object in the plurality of images, and a second detection frame having an appropriate value is determined. model post-processing means for correcting the included position information and second likelihood information;
Robustness verification that verifies the robustness of the object detection model based on the position information including the second detection frame that is the output result of the model post-processing means, the second likelihood information, and the various processing parameters. comprising means;
Performance indexing device.
When processing the plurality of images to be input to the object detection model, the model preprocessing means processes N times (N is N in the horizontal direction) in S pixel steps (S is an arbitrary decimal number) as the various processing parameters. , an arbitrary integer), and generate a total of N×M position shifted images using M position shifts in the vertical direction (M is an arbitrary integer).
The performance indexing device according to claim 1.
When processing the plurality of images to be input to the object detection model, the model preprocessing means enlarges or reduces the images using L types of arbitrary magnifications (L is an arbitrary integer) as the various processing parameters. After generating an image, the image is processed in S pixel steps (S is an arbitrary decimal number) N times in the horizontal direction (N is an arbitrary integer) and M times in the vertical direction (M is an arbitrary integer). generate a total of N×M×L position-shifted images using the position shift of
The performance indexing device according to claim 1.
When processing the plurality of images to be input to the object detection model, the model preprocessing means may process P types (P is any integer) of contrast correction curves or gradation conversion curves as the various processing parameters. to generate an image with the brightness level changed to an arbitrary value, using
The performance indexing device according to claim 1.
The model preprocessing means, when processing the plurality of images to be input to the object detection model, uses Q types (Q is an arbitrary integer) of aspect ratios as the various processing parameters to determine the aspect ratio. generate a modified image,
The performance indexing device according to claim 1.
The model preprocessing means changes the rotation angle using R types of angles (R is any integer) as the various processing parameters when processing the plurality of images input to the object detection model. generate an image,
The performance indexing device according to claim 1.
The model preprocessing means, when processing the plurality of images to be input to the object detection model, pastes the average brightness level of the valid images into a blank area where no valid images are generated due to the processing, and processes the images. generate,
The performance indexing device according to claim 1.
The model post-processing means detects zero or a plurality of detected objects, including undetectable and pseudo-detected objects, for each of one or more detected objects of the output results of the object detection model existing in one of the plurality of images. For the position information including the first detection frame and the first likelihood information, an arbitrary threshold T (T is an arbitrary decimal number) for the first likelihood information and a mutual For each detected object, the maximum likelihood of comprising individual identification means for correcting position information including a second detection frame and the second likelihood information;
The performance indexing device according to claim 1.
The model post-processing means includes:
If there is position information and class identification information including a correct detection frame for each detection object, the present invention has a function of correcting the position information including the correct detection frame according to the contents of the various processing parameters. ,
Each of one or more detected objects of the output results of the object detection model existing in one image of the plurality of images includes zero or more first detection frames including undetectable and false detection. For the position information and the first likelihood information, an arbitrary threshold T (T is an arbitrary decimal number) for the first likelihood information, the position information including the correct detection frame, and the first likelihood information. An arbitrary threshold value U (U is an arbitrary decimal number) for the IOU (Intersection over Union) value, which is an index showing how much the region of position information including one detection frame overlaps, is used to determine the maximum value for each detected object. comprising individual identification means for correcting position information including the second likelihood detection frame and the second likelihood information;
The performance indexing device according to claim 1.
The model post-processing means individually links the various processing parameters used in the processing of the plurality of images by the model pre-processing means and the output results of the individual identification means for each detected object, and Output to robustness verification means,
The performance indexing device according to claim 8.
The model preprocessing means performs at least one of (i) and (ii),
In (i), the model preprocessing means, when processing the plurality of images to be input to the object detection model, uses S pixel steps (S is an arbitrary decimal number) in the horizontal direction as the various processing parameters. Generate a total of N×M position shifted images by using position shifts N times (N is any integer) in the vertical direction and M times (M is any integer) in the vertical direction,
In the above (ii), the model preprocessing means processes L types (L is any integer) of arbitrary magnifications as the various processing parameters when processing the plurality of images to be input to the written object detection model. is used to generate an enlarged or reduced image, and then the image is scaled in S pixel steps (S is any decimal number) by N horizontal steps (N is any integer) and M steps vertically (M is an arbitrary integer) to generate a total of N×M×L position-shifted images,
The robustness verification means determines the various processing parameters based on the position information including the second detection frame, which is an output result of the model post-processing means, and the likelihood in the second likelihood information. Separately, a likelihood distribution indicating the variation accompanying the position shift for each detection object, an average likelihood that is the average value of the effective area of the likelihood, a histogram of the likelihood, and an effective area of the likelihood. The standard deviation of the likelihood, which is the standard deviation, the maximum likelihood, which is the maximum value of the valid area of the likelihood, the minimum likelihood, which is the minimum value of the valid area of the likelihood, and the IOU value for the likelihood. comprising probability statistical calculation means for calculating at least one of
The performance indexing device according to any one of claims 2 to 10.
The model preprocessing means performs at least one of (i) and (ii),
In (i), the model preprocessing means, when processing the plurality of images to be input to the object detection model, uses S pixel steps (S is an arbitrary decimal number) in the horizontal direction as the various processing parameters. Generate a total of N×M position shifted images by using position shifts N times (N is any integer) in the vertical direction and M times (M is any integer) in the vertical direction,
In the above (ii), the model preprocessing means processes L types (L is any integer) of arbitrary magnifications as the various processing parameters when processing the plurality of images to be input to the written object detection model. is used to generate an enlarged or reduced image, and then the image is scaled in S pixel steps (S is any decimal number) by N horizontal steps (N is any integer) and M steps vertically (M is an arbitrary integer) to generate a total of N×M×L position-shifted images,
If there is position information including a correct detection frame and correct class identification information for each detected object, the robustness verification means detects the second detection which is the output result of the model post-processing means. A class identification accuracy rate calculated from the IOU value of the position information including the frame and the position information including the correct detection frame, the class identification information in the second likelihood information, and the correct class identification information. Based on the above, for each of the various processing parameters, an IOU distribution and a class identification accuracy rate distribution showing the variation due to the position shift for each detected object with respect to the IOU value and the class identification accuracy rate, and the IOU value and the class identification accuracy rate distribution. The average IOU value, which is the average value of the effective area with the class identification correct answer rate, the average class identification correct answer rate, a histogram of the IOU value, a histogram of the class identification correct answer rate, and the IOU value and the class identification correct answer rate. The standard deviation of the IOU value, which is the standard deviation of the effective area, and the standard deviation of the class identification correct answer rate, and the maximum IOU value and the maximum class identification correct answer rate, which are the maximum values of the effective area of the IOU value and the class identification correct answer rate. and a probability statistical calculation means for calculating at least one of the minimum IOU value and the minimum class identification accuracy rate, which are the minimum values in the effective area of the IOU value and the class identification accuracy rate,
The performance indexing device according to any one of claims 2 to 10.
The robustness verification means extracts, for each of the various processing parameters, a position or region where the likelihood distribution for each of the detected objects is below an arbitrary threshold, and the detected object where the average likelihood is below an arbitrary threshold. extraction of the detected object for which the standard deviation of the likelihood is greater than or equal to an arbitrary threshold; extraction of the detected object for which the maximum likelihood is less than or equal to an arbitrary threshold; comprising a learning reinforcement necessary item extraction means that performs at least one of the following: extracting the detected object whose IOU value is equal to or less than an arbitrary threshold;
The performance indexing device according to claim 11.
The robustness verification means extracts, for each of the various processing parameters, a position or region that is equal to or less than an arbitrary threshold value in the IOU distribution for each detected object, and a position or region that is equal to or less than an arbitrary threshold value in the class identification accuracy rate distribution. extraction of a region, extraction of the detected object for which the average IOU value is below an arbitrary threshold, extraction of the detected object for which the average class classification accuracy rate is below an arbitrary threshold, and the standard deviation of the IOU value is extraction of the detected object whose standard deviation of the class classification accuracy rate is greater than or equal to an arbitrary threshold; and extraction of the detected object whose maximum IOU value is equal to or less than an arbitrary threshold. extraction of the detected object for which the maximum class identification accuracy rate is below an arbitrary threshold; extraction of the detected object for which the minimum IOU value is below an arbitrary threshold; comprising a learning reinforcement necessary item extraction means that performs at least one of the following: extracting the detected object whose value is equal to or less than a threshold;
The performance indexing device according to claim 12.
The probability statistical calculation means of the robustness verification means and the learning reinforcement necessary item extraction means perform a probability statistical calculation based on the likelihood, the IOU value, and the class identification correct answer rate. It has a function to exclude from the calculation target images in which pixels related to the detected object are missing at an arbitrary rate.
The performance indexing device according to claim 14.
As a result of analysis based on the output of the probability statistical calculation means, if it is determined that the performance of the model learning dictionary is insufficient, a learning image is prepared based on the result of the learning reinforcement necessary item extraction means and the built-in or relearning the model learning dictionary using an external dictionary learning means;
The performance indexing device according to claim 13.
The object detection model is a neural network including a model learning dictionary created by deep learning.
The performance indexing device according to claim 1.
an image processing step of acquiring and appropriately processing the image;
a model preprocessing step of processing the image obtained in the image processing step into a plurality of images according to various processing parameters;
an object detection model including a model learning dictionary that infers object positions and likelihoods with respect to the input of the plurality of images processed in the model preprocessing step;
Based on the inference result of the object detection model, for each detected object in the plurality of images, position information including the first detection frame and first likelihood information are set to include a second detection frame having appropriate values. a model post-processing step for correcting position information and second likelihood information;
a robustness verification step of verifying the robustness of the object detection model based on the position information including the second detection frame that is the output result of the model post-processing step, the second likelihood information, and the various processing parameters; including
Performance index method.
A program for causing a computer to execute the performance indexing method according to claim 18.