WO2021085258A1

WO2021085258A1 - Image processing device, image processing device control method, identifier generation method, identification method, identification device, identifier generation device, and identifier

Info

Publication number: WO2021085258A1
Application number: PCT/JP2020/039496
Authority: WO
Inventors: 泰吉正; 彰大田谷; 河村　英孝
Original assignee: キヤノン株式会社
Priority date: 2019-10-31
Filing date: 2020-10-21
Publication date: 2021-05-06
Also published as: JP2021093142A

Abstract

An image processing device that acquires information about a specific region of an image on the basis of inference. The image processing device has an information acquisition means that acquires the information about the specific region as inferred by inputting information about a plurality of regions of interest that have been extracted from the image on the basis of prescribed inference conditions into a trained model. The plurality of regions of interest include a first region of interest and a second region of interest. The first region of interest and the second region of interest each have a region that overlaps the other and a region that does not overlap the other.

Description

Image processing device, image processing device control method, classifier generation method, discrimination method, classifying device, classifier generating device, and classifying device

The present invention relates to an image processing device, a control method of the image processing device, a method of generating a classifier for identifying identification target information in data, a method of discriminating using the classifier generated by the method of generating the classifier, and the like. The present invention relates to a discriminator, a method of generating a discriminator, a discriminator generator, and a discriminator.

In recent years, many attempts have been made to process images using deep learning to obtain useful information. Known types of processing include image classification, object detection, and segmentation. Segmentation is a process for specifying the class (classification) to which the pixel belongs for each region, and is used for diagnosis using medical images, infrastructure inspection, various particle analysis, and the like.

Patent Document 1 describes a technique for distinguishing between benign and malignant of the target abnormal shadow by acquiring the region and the feature amount of the target abnormal shadow (hereinafter referred to as the target abnormal shadow) from the medical image. This technology extracts the region of interest from the region of interest in the medical image using multiple position coordinates that are different from each other, and performs learning to perform differential diagnosis with high accuracy even if there are variations due to the work of the doctor. is there. Increasing the data used for learning to give diversity in this way is called "Data Augmentation", and is a technique often used to improve the accuracy of inference results.

JP-A-2019-30584

However, even if the learning data was inflated, the accuracy of the inference results was not sufficient in some cases.

The image processing apparatus according to the present invention for solving the above problems is an image processing apparatus that acquires information of a specific region in an image based on inference, and extracts information from the image based on a predetermined inference condition. It has an information acquisition means for acquiring the information of the specific region, which is inferred by inputting each of the information of the plurality of attention regions to the trained model, and the plurality of attention regions are the first attention. A region and a second region of interest are included, and each of the first region of interest and the second region of interest has a region that overlaps with each other and a region that does not overlap with each other.

The control method of the image processing device according to the present invention is a control method of an image processing device that acquires information of a specific region in an image based on inference, and is extracted from the image based on a predetermined inference condition. It has an information acquisition step of acquiring the information of the specific area, which is inferred by inputting the information of the plurality of areas of interest into the trained model, and the plurality of areas of interest are the first areas of interest. A control method for an image processing apparatus including a second attention region, wherein the first attention region and the second attention region each have a region that overlaps with each other and a region that does not overlap with each other.

Another invention is a method for generating a classifier for identifying identification target information in data, and is a first training data in an initial data set including a plurality of learning data created from the data. The first learning step of learning using the set, the information contained in the first classifier generated by learning in the first learning step, and the second learning of the initial data set. The amount of the identification target information included in the first learning data set includes a second learning step of updating the information contained in the classifier by learning using the data set for learning. The present invention relates to a method for generating a classifier, which is characterized in that the amount of the identification target information included in the second learning data set is larger than the amount of the identification target information.

Yet another generation method according to the present invention is a method for generating a classifier for estimating identification target information in data, and includes input data and learning data composed of teacher data for the input data. For a training data set group having 1 training data set and a second training data set including a larger number of the training data than the first training data set, the first training data set The padding step of padding the training data so that the number of the training data included is equal to or greater than the number of the training data included in the second training data set, and the padded training data having the padded training data. The amount of the identification target information contained in the input data included in the first training data set includes a generation step of generating the classifier using the training data set group, and the amount of the identification target information is the second training data. The input data included in the set is characterized in that it is larger than the amount of the identification target information.

Yet another generator according to the present invention is a generator for estimating identification target information in data, and includes input data and learning data composed of teacher data for the input data. For a training data set group having 1 training data set and a second training data set including a larger number of the training data than the first training data set, the first training data set The inflated means for inflating the training data and the inflated training data so that the number of the training data included is equal to or greater than the number of the training data included in the second training data set. The amount of the identification target information contained in the input data included in the first training data set includes a generation means for generating the classifier using the training data set group, and the amount of the identification target information is the second training data. The input data included in the set is characterized in that it is larger than the amount of the identification target information.

According to the image processing apparatus according to the present invention, since each of the information of the plurality of areas of interest in the image is input to the trained model to perform inference, the inference accuracy of the information of the specific area in the image can be improved. ..

It is a figure which shows an example of the structure of the image processing system which concerns on embodiment 1-1 of this invention. It is a figure which shows an example of the structure of the image processing system which concerns on embodiment 1-1 of this invention. It is a figure which shows an example of the processing procedure by the image processing apparatus 100 which concerns on 1st Embodiment of this invention. It is a figure explaining an example of the extraction process of the region of interest which concerns on embodiment 1-1 of this invention. It is a figure which shows an example of the effect of the image processing system which concerns on 1st Embodiment of this invention. It is a figure explaining an example of the extraction process of the region of interest which concerns on embodiment 1-2 of this invention. It is a figure which shows an example of the effect of the image processing system which concerns on embodiment 1-2 of this invention. It is a figure explaining an example of the extraction process of the region of interest which concerns on embodiment 1-3 of this invention. It is a figure which shows an example of the effect of the image processing system which concerns on embodiment 1-3 of this invention. It is a figure explaining an example of the extraction process of the region of interest which concerns on embodiment 1-4 of this invention. It is a figure which shows an example of the effect of the image processing system which concerns on embodiment 1-4 of this invention. It is a figure which shows an example of the apparatus configuration of the learning system which concerns on embodiment 2-1. It is a figure which shows an example of the functional structure of the learning system which concerns on embodiment 2-1. It is a flow chart which shows an example of the generation method of the classifier which concerns on embodiment 2-1. It is a figure which shows an example of the functional structure of the learning system which concerns on embodiment 2-2. It is a flow chart which shows an example of the generation method of the classifier which concerns on 2-2nd Embodiment. It is a figure which shows an example of the data expansion processing procedure which concerns on Embodiment 2-2. It is a figure which shows an example of the input data which concerns on embodiment 2-3. It is a figure which shows an example of the functional structure of the learning system which concerns on embodiment 2-3. It is a flow chart which shows an example of the generation method of the classifier which concerns on embodiment 2-3. It is a figure for demonstrating the flow of the method of generating the classifier which concerns on 3rd Embodiment of this invention. It is a figure for demonstrating the structure of the generator of the classifier which concerns on 3rd Embodiment of this invention. It is a figure which shows an example of the structure of the generation system provided with the generation apparatus which concerns on embodiment 3-1 of this invention. It is a figure which shows an example of the functional structure of the generator which concerns on 3rd Embodiment of this invention. It is a figure which shows an example of the flow of the generation apparatus 100 which concerns on embodiment 3-1 of this invention. It is a figure which shows an example of the learning data which concerns on Example 1 of this invention. It is an enlarged view of an example of learning data which concerns on Example 3 of this invention.

Hereinafter, the embodiment will be described in detail by way of example with reference to the drawings. However, the components described in this embodiment are merely examples, and the technical scope of the present invention is determined by the scope of claims and is not limited by the following individual embodiments. Absent.

<< First Embodiment >>
The image processing apparatus according to the first embodiment of the present invention will be described with reference to FIGS. 2 and 4. The image processing apparatus 1-100 according to the present embodiment acquires information of a specific region in an image based on inference. Specifically, each of the information of the plurality of attention regions (1-540 to 1-542) extracted from the image (1-500) based on a predetermined inference condition is input to the trained model 1-47. The information acquisition means 1-50 for acquiring the information of the specific area (1-520) inferred by the above is provided. The plurality of attention regions include a first attention region (for example, 1-540) and a second attention region (for example, 1-541).

First, a trained model is used to extract a specific region (1-520) in image 1-500 by inference. The trained model is obtained by training an image whose specific region is known as teacher data. Then, each of the information of the plurality of areas of interest extracted from the image is input to the above-mentioned trained model. At this time, as shown in FIG. 4, the plurality of areas of interest are selected so as to have a region that overlaps with each other and a region that does not overlap with each other. As a result, not only a plurality of inference results can be obtained in a certain region A (area in which the regions of interest overlap each other) in the image, but also inference results in the region around the region A can be obtained. By using these plurality of inference results, it is considered that the accuracy of inference is improved and information on a specific region in the image can be obtained correctly.

The image in the present embodiment is, for example, an image including an image of a first material and an image of a second material different from the first material. In this case, the information in the specific region includes at least one of the position of the image of the second material in the image and the size of the image of the second material.

It is preferable that the size of the first attention area and the size of the second attention area are the same because it is easy to input to the trained model.

In the present embodiment, the information of the region of interest includes information on at least one of the position and size of the region extracted from the image.

Further, the image processing device according to the present embodiment may further have a reception unit 1-41 that accepts the setting of inference conditions. The reception unit may be one that receives an instruction issued by the user operating the operation unit 1-140, one that receives an automatic instruction by the image processing device, or another.

The information acquisition means 1-50 may have a model acquisition unit 1-42 for acquiring the trained model 1-47. The model acquisition unit has a generation unit (not shown) that generates a trained model, and the trained model may be acquired from the generation unit or may be acquired from the data server 1-120. Further, the information acquisition means may have extraction units 1-43 that extract a plurality of regions of interest from the image based on the inference conditions received by the reception unit. Further, the information acquisition means acquires a plurality of inference results by inputting each of the plurality of attention regions extracted by the extraction unit into the trained model, and obtains information in a specific region based on the plurality of inference results. It may have the information acquisition unit 1-45 to be acquired.

Note that the extraction unit may extract a plurality of areas of interest using random numbers, may extract areas of interest regularly from end to end of the image, or may use both methods.

The inference conditions are, for example, the number of inferences performed on average for each pixel of the image, the threshold value of the ratio of the number of times the area of interest is inferred to be a specific area to the number of times the area of interest is inferred, and the attention. Includes at least one of the size of the area.

Further, when the image includes a plurality of specific areas and the areas of the plurality of specific areas have a distribution, the image processing apparatus according to the present embodiment is preferably used. Further, when the ratio of the maximum value to the minimum value of the area of the plurality of specific regions is 50 or more, particularly when the ratio is 100 or more, the image processing apparatus according to the present embodiment is preferably used.

Further, the image processing apparatus according to the present embodiment is a display control unit that displays on the display unit so that the display mode of the specific area in the image and the display mode other than the specific area are different based on the information of the specific area. May further have. For example, as shown in FIG. 4, it is possible to display the specific areas 1-520 as black and the other areas as white. As a means for changing the display mode, a means other than changing the color may be used.

As described above, the control method of the image processing device according to the embodiment of the present invention is the control method of the image processing device that acquires the information of the specific region in the image based on the inference. Specifically, an information acquisition step of acquiring information of a specific region inferred by inputting each of the information of a plurality of regions of interest extracted from an image based on a predetermined inference condition into a trained model. Has. The plurality of areas of interest include a first area of interest and a second area of interest, and the first area of interest and the second area of interest do not overlap with each other. Has an area.

Hereinafter, the embodiment will be described in detail by way of example with reference to the drawings. However, the components described in this embodiment are merely examples, and the technical scope of the present invention is not limited to the following individual embodiments.

(Embodiment 1-1)
(Overview)
The image processing apparatus according to the first embodiment of the present invention processes the inference process using the trained model. In the inference process, the user sets inference conditions, and the image processing device extracts a plurality of regions of interest from the inference image based on the inference conditions. Next, the image processing device makes inferences using a common trained model for each of the plurality of areas of interest, and calculates the final inference result based on each inference result. Here, the inference result refers to, for example, an object detection result or a segmentation result. In the following description, the case where the resin image on the transmission electron microscope (TEM) image is to be processed will be described, but the scope of application of this embodiment is limited to the detection target and the type of image acquisition method. It's not a thing. Hereinafter, a specific device configuration, functional configuration, and processing flow will be described.

(Device configuration)
An image processing system 1-190 including an image processing device according to an embodiment of 1-1 of the present invention and each device connected to the image processing device 1-100 will be described with reference to FIG. The image processing system 1-190 includes an image capturing device 1-110 for capturing an image, a data server 1-120 for storing the captured image, and an image processing device 1-100 for performing image processing. Further, it has a display unit 1-130 for displaying the acquired input image and the image processing result, and an operation unit 1-140 for inputting an instruction from the user. The image processing device 1-100 acquires an input image and performs image processing on the region of interest reflected in the input image. The input image is, for example, an image obtained by subjecting image data acquired by the image capturing apparatus 1-110 to image processing or the like to obtain an image suitable for analysis. Further, the input image in the present embodiment is an inference image. Each part will be described below. The image processing device 1-100 is, for example, a computer, and performs image processing according to the present embodiment. The image processing device 1-100 has at least a CPU 1-31, a communication IF 1-32, a ROM 1-33, a RAM 1-34, a storage unit 1-35, and a common bus 1-36. The CPU 1-31 integrally controls the operation of each component of the image processing device 1-100.

The image processing device 1-100 may also control the operation of the image capturing device 1-110 by controlling the CPU 1-31. The data server 1-120 holds an image captured by the image capturing device 1-110. Communication IF (Interface) 1-32 is realized by, for example, a LAN card. Communication between the external device (for example, data server 1-120) and the image processing device 1-100 is performed by the communication IF1-32. The ROM 1-33 is realized by a non-volatile memory or the like, stores a control program executed by the CPU 1-31, and provides a work area when the program is executed by the CPU 1-31. RAM (Random Access Memory) 1-34 is realized by a volatile memory or the like, and temporarily stores various information. The storage unit 1-35 is realized by, for example, an HDD (Hard Disk Drive) or the like. Then, the storage unit 1-35 stores various application software including an operating system (OS: Operating System), a device driver of a peripheral device, and a program for performing image processing according to the present embodiment described later. The operation unit 1-140 is realized by, for example, a keyboard, a mouse, or the like, and inputs an instruction from the user into the device. The display unit 1-130 is realized by, for example, a display or the like, and displays various information toward the user. The operation unit 1-140 and the display unit 1-130 provide a function as a GUI (Graphical User Interface) under the control of the CPU 1-31. The display unit 1-130 may be a touch panel monitor that accepts operation input, and the operation unit 1-140 may be a stylus pen. Each of the above components is communicably connected to each other by common bus 1-36.

The imaging apparatus 1-110 is, for example, a scanning electron microscope (SEM), a transmission electron microscope (TEM: Transmission Electron Microscope), or an optical microscope. The image capturing device 1-110 may also be a device having an image capturing function such as a digital camera or a smartphone. The image capturing device 1-110 transmits the acquired image to the data server 1-120. An imaging control unit (not shown) that controls the imaging apparatus 1-110 may be included in the image processing apparatus 1-100.

(Functional configuration)
Next, the functional configuration of the image processing system including the image processing apparatus 1-100 according to the present embodiment will be described with reference to FIG. When the CPU 1-31 executes the program stored in the ROM 1-33, the functions of the respective parts shown in FIG. 2 are realized. The main body that executes the program may be one or more CPUs, and the ROM that stores the program may also be one or more memories. Further, another processor such as GPU (Graphics Processing Unit) may be used instead of the CPU or in combination with the CPU. That is, the functions of the respective parts shown in FIG. 2 are realized by executing a program stored in at least one or more memories in which at least one or more processors (hardware) are communicably connected to the processors.

The image processing device 1-100 has a reception unit 1-41, a model acquisition unit 1-42, an extraction unit 1-43, an inference unit 1-44, an information acquisition unit 1-45, and a display control unit 1-46 as functional configurations. Has. The image processing device 1-100 is communicably connected to the data server 1-120 and the display unit 1-130.

Reception unit 1-41 receives the inference condition input from the user via operation unit 1-140. That is, the operation unit 1-140 corresponds to an example of a reception means that accepts the setting of the inference condition.

The inference condition includes at least one of information on the number of inferences (described later), a threshold value, and a patch size. The model acquisition unit 1-42 acquires the trained model 1-47 constructed in advance and the inference image from the data server 1-120. The extraction unit 1-43 extracts a plurality of regions of interest from the inference image based on the inference conditions received by the reception unit 1-41. That is, it corresponds to an example of an extraction means for extracting a plurality of regions of interest from an image for inference.

Here, the area of interest refers to a part cut out from the inference image.

The inference unit 1-44 makes inferences for each of the plurality of areas of interest using the trained model 1-47 acquired by the model acquisition unit 1-42. That is, it corresponds to an example of an inference means that makes an inference using a common trained model for each of a plurality of areas of interest.

The information acquisition unit 1-45 calculates the final inference result based on the inference result performed by the inference unit 1-44. That is, it corresponds to an example of a calculation means for calculating the final inference result based on a plurality of inference results.

The display control unit 1-46 outputs the information regarding the inference result acquired in each process to the display unit 1-130, and causes the display unit 1-130 to display the result of each process.

Note that at least a part of each part of the image processing device 1-100 may be realized as an independent device. The image processing device 1-100 may be a workstation. The functions of each part may be realized as software that operates on a computer, and the software that realizes the functions of each part may be realized on a server via a network such as a cloud. In the present embodiment described below, it is assumed that each part is realized by software running on a computer installed in a local environment.

(Processing flow)
Subsequently, the image processing according to the first embodiment of the present invention will be described. FIG. 3 is a diagram showing a processing procedure of processing executed by the image processing apparatus 1-100 of the present embodiment. This embodiment is realized by the CPU 1-31 executing a program that realizes the functions of each part stored in the ROM 1-33. In this embodiment, an example in which the image to be processed is a TEM image will be described. The TEM image is acquired as a two-dimensional shading image. Further, in the present embodiment, carbon black in the coating film of the melamine / alkyd resin paint will be described as an example of the object to be processed included in the image to be processed. In this embodiment, 10 images for inference were used, and the value of the maximum area / minimum area of carbon black in one image was 30 to 120. In the processing process, the processes from S1-201 to S1-206 are performed for each inference image, but in order to omit duplicate explanations, a case where the processing is applied to one inference image will be described below. ..

In step S1-201, the reception unit 1-41 receives the inference condition input by the user in the operation unit 1-140. The inference condition in the present embodiment includes at least one of information regarding the number of inferences, a threshold value, and a patch size. The information regarding the number of inferences is information such as the average number of inferences and the number of extractions of each pixel, which will be described later.

In step S1-202, the model acquisition unit 1-42 acquires the trained model constructed in advance and the inference image. The inference image is acquired from the data server 1-120. If the patch size is set in steps S1-201, the trained model trained with the same patch size is acquired. Here, the patch size is the number of pixels in the vertical and horizontal directions of the cropped image when a part of the target image is cropped.

Here, an example of how to build the trained model 1-47 will be described. Here, an example of performing segmentation as a type of image processing will be described, but the scope of application of this embodiment is not limited to the type of image processing. First, a pair of a TEM image, which is an image to be processed, and a teacher image is prepared. There may be multiple pairs. Here, the teacher image is an image processed image to be processed by using an appropriate image processing method. For example, it is an image obtained by binarizing an area to be detected and an area not to be detected, an image in which an area to be detected is filled, and an image in which an area not not to be detected is not filled.

Next, the trained model 1-47 is generated by performing machine learning according to a predetermined algorithm using the image to be processed and the teacher image. In this embodiment, U-Net is used as a predetermined algorithm. As a learning method using U-Net, a known technique can be used. Further, as a predetermined algorithm, for example, SVM (Support Vector machine), DNN (Deep Neural Network), CNN (Convolutional Neural Network), or the like may be used. In addition to U-Net, FCN (Fully Convolutional Network), SegNet, and the like can also be used as an algorithm used for Semantic Segmentation that classifies classes in 1-pixel units. Further, an algorithm in which a so-called generative model such as GAN (Generative Adversarial Networks) is combined with the above algorithm may be used. If there are multiple types of processing that you want to execute, build different learning models so that each processing can be executed. Further, in order to increase the amount of data used for learning, inflating (Data Augmentation) may be performed.

The process of steps S1-203 will be described with reference to FIG. Here, too, the carbon black in the melamine / alkyd resin is the object to be treated. In step S1-203, the extraction unit 1-43 extracts a plurality of regions of interest from the inference image. FIG. 4 shows an example in which the attention region 1-540, the attention region 1-541, and the attention region 1-542 are extracted with respect to the position coordinates 1-530, the position coordinates 1-531, and the position coordinates 51-32. Shown. The inference image in this embodiment is composed of a plurality of pixels whose positions can be specified by two-dimensional Cartesian coordinates (x, y). Assuming that the number of pixels in the horizontal direction and the vertical direction of the image is x_size and y_size, respectively, 0 ≦ x ≦ x_size and 0 ≦ y ≦ y_size hold. Starting from the upper left of the image, the x-axis is to the right and the y-axis is to the bottom, and multiple position coordinates that are different from each other are (x- _i , y- _i- ) (i = 1, 2, ..., N). , 0 ≤ x- _i ≤ x_size, 0 ≤ y _i -≤ y_size. In the present embodiment, a set of random numbers (x- _i , y- _i- _{) satisfying 0 ≤ x-i} ≤ x_size and 0 ≤ y _i -≤ y_size is generated. Next, the region of interest is set with (x- _i , y- _{i-) as the upper left coordinate.} The size of the area of interest should be equal to the patch size. In the present embodiment, the user sets the average number of inferences for each pixel in the operation unit 1-140. The average number of inferences is the average number of extractions for each pixel when performing extraction. When extracting, it can be obtained by recording the number of times of extraction for each pixel. _{Further, (x- i, y i -} ) is positioned near the end of the image, when the size of the region of interest becomes smaller than the patch size may fill the periphery of the image pixel values 0, so-called padding process By doing something like this, adjust the size of the area of interest so that it is the same as the patch size.

In step S1-204, the inference unit 1-44 makes an inference using the trained model 1-47 for each of the plurality of areas of interest extracted in step S1-203.

In step S1-205, the information acquisition unit 1-45 calculates and acquires the final inference result based on the inference result in step S1-204. In the present embodiment, the number of times inferred and the number of times determined to be carbon black are recorded for each pixel, and the number of times determined to be carbon black / the number of times inferred becomes equal to or greater than the threshold value. Finally, it is determined that it is carbon black. The threshold value may be set by the user in the operation unit 1-140. If the inference is not classification but regression processing, a new threshold is set in addition to the above threshold, and if it is above the threshold, the result is classified in advance by assuming that it is carbon black, and then the final result. Judgment processing is performed.

In step S1-206, the display control unit 1-46 causes the display unit 1-130 to display the final inference result. In this case, the display control unit 1-46 controls the display unit 1-130 to transmit the final inference result to the display unit 1-130 connected to the image processing device 1-100 and display the final inference result on the display unit 1-130. In the present embodiment, it is determined for each pixel whether or not it is carbon black, and the pixel determined to be carbon black is displayed with a brightness of 255, and the pixel determined to be not carbon black is displayed with a brightness of 0.

The effect of the image processing apparatus according to the first embodiment will be described with reference to FIG. In this embodiment, IoU (Intersection over Union) was used as an evaluation index in order to measure the effect. IoU is defined by the equation (1-1).

Here, TP (True Positive) is the number of pixels that are carbon black determined to be carbon black, and FP (False Positive) is the number of pixels that are not carbon black are determined to be carbon black (number of false positives). Is. The FN (False Negative) is a number (undetected number) in which a pixel that is carbon black is determined not to be carbon black.

Further, images of x_size = 1280 and y_size = 960 were used. The patch size was 128 × 128 and the threshold was 0.1. As shown in the graph of FIG. 5, in the conventional method, IoU = 0.61, but as the average number of inferences increased, the IoU increased, and when the average number of inferences was 30 times, IoU = 0.84.

As described above, the image processing apparatus 1-100 in the present embodiment can improve the inference accuracy by performing inference using a common trained model for each of a plurality of areas of interest. Further, since the user can set the threshold value, the inference accuracy can be controlled according to the purpose. For example, if you want to reduce undetected, lower the threshold, and if you want to reduce false positives, raise the threshold. You can make inferences according to the purpose while using the same trained model.

(Embodiment 1-2)
(Overview)
Next, an example of the first and second embodiments will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the above embodiment will be omitted, and the differences from the above embodiment will be mainly described.

(Processing flow)
In step S1-201, the reception unit 1-41 receives the inference condition input by the user in the operation unit 1-140. The inference condition in the present embodiment includes at least one of information regarding the number of inferences, a threshold value, and a patch size. The information regarding the number of inferences is information such as the number of times the reference coordinates are set, which will be described later.

In the first embodiment, all the areas of interest are determined using different random numbers, but in the first and second embodiments, only some areas of interest are determined using random numbers, and the others. The region is mechanically determined from the coordinates of the region of interest determined using random numbers. The process of steps S1-203 will be described with reference to FIG. Each region of interest in FIG. 6 has an region that partially overlaps with an adjacent region of interest. In step S1-203, the extraction unit 1-43 extracts a plurality of regions of interest with respect to the inference image 1-501. 6, reference coordinates 1-560 of _(x- _{1, y} 1) as a reference, an example that extracts the attention region 1-550-interest region 1-558. The number of pixels x-axis direction of the region of interest p _x, the number of pixels y-axis direction of the region of interest and p _y. A plurality of reference coordinates _{_{(x- j, y j) (}} j = 1,2, ···, N) and _{then, (x- _j,} _y j) is _{_{0 ≦ x- j ≦ p x,}} 0 ≦ y j - a set of random numbers satisfying ≦ p _y. The coordinates of the upper left of the other areas of interest are (x- _j + p _x x m, y _j + p _y x n) (where n is 1 or more, x_size / p _x -1 or less integer. M is 1 or more, y_size _{/ p} y -1 an integer). In the present embodiment, the user sets the reference coordinate setting number of times in the operation unit 1-140. The reference coordinate setting number is the number of times that the upper left reference coordinate (x- _j , _yj )) is set by using a random number when extracting.

The effect of the image processing apparatus according to the first and second embodiments will be described with reference to FIG. Similar to the first embodiment, the evaluation was performed using IoU. Here, images of x_size = 1280 and y_size = 960 were used. The patch size was 128 × 128, and the threshold value was 0.2. As shown in the graph of FIG. 7, in the conventional method, IoU = 0.61, but as the average number of inferences increased, IoU increased, and when the number of reference coordinate settings was 30 times, IoU = 0.86.

(Embodiment 1-3)
(Overview)
Next, an example of the first to third embodiments will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the above embodiment will be omitted, and the differences from the above embodiment will be mainly described. In the present embodiment, red blood cells, white blood cells, and platelets in blood in the optical microscope image will be described as an example of the object to be processed included in the image to be processed.

In the first embodiment, all the areas of interest are determined using different random numbers, but in the first-third embodiment, only some areas of interest are determined using random numbers, and other areas of interest are determined. The region is mechanically determined from the coordinates of the region of interest determined using random numbers. The process of steps S1-203 will be described with reference to FIG. Each region of interest in FIG. 8 has an region that partially overlaps with an adjacent region of interest. In step S1-203, the extraction unit 1-43 extracts a plurality of regions of interest with respect to the inference image 1-502. 8, reference coordinates 1-660 of _(x- _{1, y} 1) as a reference, an example that extracts the attention region 1-560-interest region 1-568. The number of pixels x-axis direction of the region of interest p _x, the number of pixels y-axis direction of the region of interest and p _y. A plurality of reference coordinates _{_{(x- j, y j) (}} j = 1,2, ···, N) and _{then, (x- _j,} _y j) is _{_{0 ≦ x- j ≦ p x,}} 0 ≦ y j - a set of random numbers satisfying ≦ p _y. The coordinates of the upper left of the other areas of interest are (x- _j + p _x x m, y _j + p _y x n) (where n is 1 or more, x_size / p _x -1 or less integer. M is 1 or more, y_size _{/ p} y -1 an integer). In the present embodiment, the user sets the reference coordinate setting number of times in the operation unit 1-140. The reference coordinate setting number is the number of times that the upper left reference coordinate (x- _j , _yj )) is set by using a random number when extracting.

The effect of the image processing apparatus according to the first to third embodiments will be described with reference to FIG. In this embodiment, evaluation was performed using mIoU. mIoU is defined by equation (1-2).

C is the number of classification classes, and in this embodiment, c = 3. In addition, images of x_size = 1280 and y_size = 960 were used. The patch size was 128 × 128 and the threshold was 0.2. As shown in the graph of FIG. 9, mIoU = 0.55 in the conventional method, but IoU increased as the average number of inferences increased, and IoU = 0.77 when the number of reference coordinate settings was 30 times.

(Embodiment 1-4)
(Overview)
Next, an example of the first to fourth embodiments will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the above embodiment will be omitted, and the differences from the above embodiment will be mainly described. In the present embodiment, humans, automobiles, and roads included in an image taken by a digital camera will be described as an example of a processing object included in the processing target image.

(Processing flow)
In step S1-201, the reception unit 1-41 receives the inference condition input by the user in the operation unit 1-140. The inference condition in the present embodiment includes at least one of information regarding the number of inferences, a threshold value, and a patch size. The information regarding the number of inferences is information such as pitch, which will be described later.

In the first embodiment, all the areas of interest are determined using different random numbers, but in the first to fourth embodiments, all the areas of interest are determined without using random numbers. The process of step S1-203 will be described with reference to FIG. In step S1-203, the extraction unit 1-43 extracts a plurality of regions of interest from the inference image. 10, the reference coordinates 1-580 of _(x- _{1, y} 1) as a reference, an example that extracts the attention region 1-570-interest region 1-572. A plurality of areas of interest are extracted by shifting the areas of interest by the pitch vertically or horizontally. Here attention area in the x-axis direction of the number of pixels _{p x,} the number of pixels y-axis direction of the region of interest and _p y, the pitch in the x-axis direction and Pitch_x, and pitch_y pitch in the y-axis direction, 0 <pitch_x < and _{_{p x, 0 <pitch_y- <p}} y. In FIG. 10, the upper left coordinates of the region of interest 1-571 and the region of interest 1-572 are (x- ₁ + pitch_x, y ₁ ) and (x- ₁ + 2 pitch_x, y ₁ ), respectively. In FIG. 8, only three regions of interest are shown, but the upper left coordinates are (x- ₁ + pitch_x × m, y ₁ + pitch_y × n) (where n is 1 or more and x_size / pitch_x-1 or less. Other regions of interest may be extracted in which m is 1 or more and an integer of y_size / pitch_y-1 or less).

The effect of the image processing apparatus according to the first to fourth embodiments will be described with reference to FIG. For the evaluation, mIoU was used as in the first and second embodiments. In addition, images of x_size = 1280 and y_size = 960 were used. The patch size was 128 × 128, the threshold value was 0.1, the pitch was pitch_x = pitch_y, and the values were 16 to 112. As in the first to third embodiments, evaluation was performed using mIoU. As in the first to third embodiments, c = 3. As shown in the graph of FIG. 11, in the conventional method, IoU = 0.50, but at any pitch, IoU is larger than 0.50, and when the pitch is 80, IoU = 0.64, and the maximum is there were.

<Combination of the first embodiment with the second and third embodiments>
In the first embodiment of the present invention, it can be combined with at least one of the second embodiment and the third embodiment of the present invention described later.

That is, when the first embodiment and the second embodiment are combined, the trained model (discriminator) that has been learned by the following first learning step and the second learning step can be used. .. The first learning step is a step in which learning is performed using the first learning data set among the initial data sets including a plurality of learning data created from the data including the identification target information. The second learning process is performed by learning using the information contained in the trained model generated by learning in the first learning process and the second training data set of the initial data sets. , Update the information contained in the trained model. The amount of identification target information included in the first learning data set is larger than the amount of identification target information included in the second learning data set. The contents of the second embodiment will be described later, and will be omitted here.

Further, when the first embodiment and the third embodiment are combined, the trained model (identifier) can be used by the following padding step and the generation step. The padding process consists of a first training data set containing input data and training data composed of teacher data for the input data, and a second training data set containing a larger number of training data than the first training data set. And, for the training data set group having, the training data is inflated so that the number of training data contained in the first training data set is equal to or larger than the number of training data contained in the second training data set. I do. The generation step generates a trained model using the padding step and the training data set group having the padded training data. The amount of identification target information contained in the input data included in the first learning data set is larger than the amount of identification target information contained in the input data included in the second learning data set. Since the contents of the third embodiment will be described later, they will be omitted here.

Further, when the first embodiment, the second embodiment, and the third embodiment are combined, the first learning step, the second, when generating the trained model of the first embodiment, Perform learning process, padding process, and generation process.

<Other Embodiments>
The image processing device and the image processing system in each of the above-described embodiments may be realized as a single device, or may be a form in which devices including a plurality of information acquisition devices are combined so as to be able to communicate with each other to execute the above-mentioned processing. Often, both are included in the embodiments of the present invention. The above-mentioned processing may be executed by a common server device or a group of servers. In this case, the common server device corresponds to the image processing device according to the embodiment, and the server group corresponds to the image processing system according to the embodiment. The image processing device and the plurality of devices constituting the image processing system need not be present in the same facility or in the same country as long as they can communicate at a predetermined communication rate.

Although the embodiment examples have been described in detail above, the present invention can take an embodiment as, for example, a system, an apparatus, a method, a program, a recording medium (storage medium), or the like. Specifically, it may be applied to a system composed of a plurality of devices (for example, a host computer, an interface device, an imaging device, a Web application, etc.), or it may be applied to a device composed of one device. good.

Needless to say, the object of the present invention is achieved by doing the following. That is, a recording medium (or storage medium) in which a software program code (computer program) that realizes the functions of the above-described embodiment is recorded is supplied to the system or device. Needless to say, the storage medium is a computer-readable storage medium. Then, the computer (or CPU or GPU) of the system or device reads and executes the program code stored in the recording medium. In this case, the program code itself read from the recording medium realizes the function of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention.

Although the preferred embodiments of the present invention have been described in detail above, the present invention is not limited to the specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed. For example, pretreatment and posttreatment may be added as appropriate.

An embodiment in which the above-described embodiments are appropriately combined is also included in the embodiment of the present invention.

<< Second Embodiment >>
(Background of the second embodiment)
In recent years, many attempts have been made to process various data using deep learning to obtain useful information. For example, image processing, voice processing, text processing and the like are known. The discrimination accuracy is improved by using deep learning, but various efforts are being made to further improve the discrimination accuracy.

Japanese Unexamined Patent Publication No. 2019-118670 (Reference 2-1) describes a diagnostic support device that supports diagnosis of a diseased area by using deep learning. This technique makes it possible to perform highly accurate diagnosis by normalizing the color brightness of an image in advance and separating the diseased part and the non-diseased part.

Also, Sakamoto, M.M. , Nakano, H. , Zhao, K.K. And Sekiyama, T.M. : Multi-stage neuralnews with single-sided classifiers for false positive redemption and it's evaluation using Lung X-ray CT Image 370-379 (2017). In (Reference 2-2), there is a technique for accurately identifying nodules from nodule candidate images by learning while removing samples that are clearly normal using a cascade type classifier in which a plurality of discriminators are connected. Are listed.

(Problem to be solved in the second embodiment)
As a result of the examination by the present inventors, when there is a plurality of information to be identified in one data (hereinafter referred to as "identification target information"), or when it is difficult to distinguish between the identification target information and other information, the literature It was found that the methods described in 2-1 and Document 2-2 are difficult to identify. Further, when there is a large difference in the amount of identification target information for each data, it has been difficult to construct a classifier capable of accurately identifying the identification target information regardless of the amount of identification target information by the conventional method. ..

Therefore, the object of the second embodiment is to accurately identify the identification target information even when there are a plurality of identification target information in one data or when it is difficult to distinguish the identification target information from other information. The purpose is to provide a method for generating a discriminator that can be identified. Another object of the present invention is to provide an identification method and an identification device using the identification device generated by the identification device generation method.

(Outline of the second embodiment)
The method of generating the classifier according to the present embodiment includes a first learning step of learning using the first learning data set among the initial data sets including a plurality of learning data created from the data. Further, it is included in the classifier by learning using the information included in the classifier generated by learning in the first learning step and the second learning data set of the initial data sets. It has a second learning step of updating information. At that time, the amount of identification target information included in the first learning data set is larger than the amount of identification target information included in the second learning data set. In this way, the classifier is trained in two steps, starting with a data set having a large amount of information to be identified. As a result, it is possible to first learn the parameters of image conversion having a large degree of conversion and gradually change the parameters, so that the identification target information can be accurately identified.

[data]
Data is a representation of information that is formalized for transmission, interpretation, or processing and can be reinterpreted as information. Examples of data include image data, voice data, text data, and the like.

[Identification target information]
The identification target information is information to be identified in the data. When the data is image data, for example, at least one piece of information on the position, area, and distribution of the identification target area in the image data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract the identification target area in the image data, which is difficult to extract visually by the user.

When the data is voice data, for example, at least one of the frequency and intensity of the identification target sound in the voice data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract the sound to be identified in the sound data including noise, which is difficult for the user to extract. When the sound data is the voice data of a plurality of speakers, the voice data of at least one speaker can be used as the identification target information.

When the data is text data, for example, at least one of the characters of the identification target character and the character string in the text data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract a character string to be identified in text data, which is difficult for the user to extract.

[Amount of information to be identified]
The amount of identification target information contained in the training data set is the value obtained by dividing the total amount of identification target information contained in the training data set by the number of training data contained in the training data set (average value). Is. Here, the learning data is a pair of input data and teacher data, and the learning data set includes a plurality of learning data. When the data is image data, the amount of identification target information included in the learning data set is, for example, the area of the identification target area in the image. The area of the identification target area in the image can be calculated from the number of pixels.

If the data is voice data, it is the length of the identification target information in the data separated by voice breaks. When selecting a data set, the initial data set may be a collection of data in which the input data and the teacher data are separated by audio breaks, etc., and the data sets may be sorted in descending order of the difference between the input data and the teacher data signals. ..

If the data is text data, it is the number of characters in the character string you want to identify in the text data. When selecting a data set, the initial data set may be a collection of data in which the input data and the teacher data are separated by a break in a sentence, and the data may be sorted in descending order of the difference between the input data and the teacher data text. ..

(Embodiment of 2-1)
Hereinafter, a case where the amount of identification target information is the amount of resin on a transmission electron microscope (TEM) image will be described, but the scope of application of this embodiment is not limited to the type of data acquisition method. .. Hereinafter, the device configuration and functional configuration of the learning system, and the processing procedure of the learning device will be specifically described.

(Device configuration of learning system)
FIG. 12 is a diagram showing an example of the device configuration of the learning system (identifier generation system) according to the second embodiment. Hereinafter, the learning system 2-190 composed of the learning device (identifier generator) 2-100 and each device connected to the learning device 2-100 will be described in detail. The learning system 2-190 includes a learning device 2-100 for learning, a data acquisition device 2-110 for acquiring data, and a data server 2-120 for storing the acquired data. Further, the learning system 2-190 is a data processing device 2-130 that processes data to create teacher data, a display unit 2-140 that displays the acquired input data and the learning result, and instructions from the user. It has an operation unit 2-150 for inputting.

The learning device 2-100 acquires a pair (learning data) of the input data and the teacher data created by processing the input data with the data processing device 2-130. The learning data set including the plurality of learning data created in this way is the initial data set. The training data set is acquired from the initial data set and training is performed. The data acquisition device 2-110 in the present embodiment is a transmission electron microscope (TEM: Transmission Electron Microscope), and the input data is a TEM image. The processing performed by the data processing apparatus 2-130 will be described later. Hereinafter, each part constituting the learning system will be described.

The learning device 2-100 is, for example, a computer, and performs learning according to the present embodiment. The learning device 2-100 has at least a CPU 2-31, a communication IF2-32, a ROM 2-33, a RAM 2-34, a storage unit 2-35, and a common bus 2-36. The CPU 2-31 integrally controls the operation of each component of the learning device 2-100. By controlling the CPU 2-31, the learning device 2-100 may also control the operations of the data acquisition device 2-110 and the data processing device 2-130.

The data server 2-120 holds the data acquired by the data acquisition device 2-110. The data processing device 2-130 processes the input data stored in the database so that it can be used for learning. Communication IF (Interface) 2-32 is realized by, for example, a LAN card. The communication IF2-32 controls communication between the external device (for example, the data server 2-120) and the learning device 2-100. The ROM 2-33 is realized by a non-volatile memory or the like, stores a control program executed by the CPU 2-31, and provides a work area when the program is executed by the CPU 2-31. RAM (Random Access Memory) 2-34 is realized by a volatile memory or the like, and temporarily stores various information.

The storage unit 2-35 is realized by, for example, an HDD (Hard Disk Drive) or the like, and includes an operating system (OS: Operating System), a device driver of a peripheral device, and a program for performing learning according to the present embodiment described later. Stores various application software.

The operation unit 2-150 is realized by, for example, a keyboard or a mouse, and inputs an instruction from the user into the device. The display unit 2-140 is realized by, for example, a display or the like, and displays various information toward the user. The operation unit 2-150 and the display unit 2-140 provide a function as a GUI (Graphical User Interface) under the control of the CPU 2-31. The display unit 2-140 may be a touch panel monitor that accepts operation input, and the operation unit 2-150 may be a stylus pen. Each component of the learning device 100 is communicably connected to each other by a common bus 2-36.

The data acquisition device 2-110 is, for example, a scanning electron microscope (SEM), a transmission electron microscope (TEM), an optical microscope, a digital camera, a smartphone, or the like. The data acquisition device 2-110 transmits the acquired data to the data server 2-120. A data acquisition control unit (not shown) that controls the data acquisition device 2-110 may be included in the learning device 2-100.

(Functional configuration of learning system)
FIG. 13 is a diagram showing an example of the functional configuration of the learning system according to the second embodiment. Hereinafter, the functional configuration of the learning system will be described with reference to FIG. When the CPU 2-31 executes the program stored in the ROM 2-33, the functions of the respective parts shown in FIG. 13 are realized. The main body that executes the program may be one or more CPUs, and the ROM that stores the program may also be one or more memories. Further, another processor such as GPU (Graphics Processing Unit) may be used instead of the CPU or in combination with the CPU. That is, the functions of the respective parts shown in FIG. 13 are realized by executing a program stored in at least one or more memories in which at least one or more processors (hardware) are communicably connected to the processors.

The learning device 2-100 includes a reception unit 2-41, an acquisition unit 2-42, a selection unit 2-43, a learning unit 2-44, a classifier 2-45, a display control unit 2-48, and a display unit 2-140. Have at least. The learning device 2-100 is communicably connected to the data server 2-120 and the display unit 2-140.

Reception unit 2-41 accepts data set selection conditions (described later) via operation unit 2-150.

Acquisition unit 2-42 acquires the initial data set from the data server 2-120.

The selection unit 2-43 processes the initial data set acquired by the acquisition unit 2-42, and selects the first learning data set and the second learning data set.

The learning unit 2-44 sequentially executes learning using the first learning data set acquired by the selection unit 2-43 and the second learning data set. That is, by learning using at least the first learning data set, the first learning, and the information contained in the classifier generated in the first learning and the second learning data set. , Perform a second learning to update the information contained in the classifier. The information included in the classifier generated in the first learning is stored in the information storage unit in the classifier.

Note that at least a part of each part of the learning device 2-100 may be realized as an independent device. The learning device 2-100 may be a workstation. The functions of each part may be realized as software that operates on a computer, and the software that realizes the functions of each part may be realized on a server via a network such as a cloud. In the present embodiment described below, it is assumed that each part is realized by software running on a computer installed in a local environment.

(Processing procedure of learning device)
FIG. 14 is a flow chart showing an example of a method for generating a classifier according to the second embodiment. Hereinafter, the processing procedure of the learning device will be described. This embodiment is realized by the CPU 2-31 executing a program that realizes the functions of each part stored in the ROM 2-33. In the present embodiment, the image to be processed will be described as a TEM image. The TEM image is acquired as a two-dimensional shading image. Further, in the present embodiment, as an example, carbon black in the coating film of the melamine / alkyd resin paint will be described as identification target information. In this embodiment, 2000 images of size 128 × 128 and 1000 pairs of initial data sets were used. The learning and evaluation were divided into 8: 2 and used.

First, an example of how to build a learning data set will be explained. Here, an example in which segmentation is performed as the type of identification will be described, but the scope of application of the present embodiment is not limited to the type of processing. In addition, the learning data set includes the learning data. The learning data is composed of input data and teacher data for the input data. When the input data is image data (input image data), the teacher data is the image data with the identification target information attached. For example, the identification target area is shown in the image data.

First, prepare a plurality of TEM images that are identification target images. Next, a correct image is created for each image. Here, the correct image is an image obtained by processing the identification target information in the identification target image by using an appropriate image processing method. For example, an image obtained by binarizing the identification target information and other information, or an image filled with the identification target information. In the present embodiment, the carbon black in the TEM image will be described using an image filled with a luminance value (0,255,0).

In step S2-201, the reception unit 2-41 receives the data set selection condition via the operation unit 2-150. The dataset selection criteria are entered by the user. In the present embodiment, the data set selection condition includes at least a method of dividing the initial data set, information on the data set used for training among the divided data sets, and a learning order. Further, here, as a method of classifying the data set, a method of classifying by the threshold value of the amount of identification target information is used. The amount of identification target information is defined by the number of pixels filled with the luminance value (0,255,0). Further, here, the threshold value is set to 5000pixel.

In step S2-202, the acquisition unit 2-42 acquires the initial data set from the data server 2-120.

In step S2-203, the selection unit 2-43 processes the initial data set acquired by the acquisition unit 2-42, and selects the first learning data set and the second learning data set.

The process of step S2-203 will be described. Here, too, the carbon black in the melamine / alkyd resin is used as the identification target information. First, the data sets are sorted in order from the one with the largest amount of identification target information. That is, the data sets are sorted in descending order of the number of pixels filled with the luminance value (0,255,0).

Next, the data set is divided according to the threshold value received by the reception unit 2-41. Then, the learning process is determined according to the information of the data set used for learning received by the reception unit 2-41 and the learning order. Here, a data set containing images having an amount of identification target information of 5000 pixels or more is referred to as a first training data set, and a data set containing images having an amount of identification target information of 0 pixels or more is referred to as a second learning data set. To do. As described above, when the second learning data set includes the first learning data set, it is possible to generate a classifier having higher discrimination accuracy.

In step S2-204, the learning unit 2-44 executes learning using the first learning data set selected by the selection unit. Here, learning refers to generating a classifier by performing machine learning according to a predetermined algorithm using a learning data set. In this embodiment, U-Net is used as a predetermined algorithm. Since the learning method of U-Net is a well-known technique, detailed description thereof will be omitted in the present embodiment.

Further, as a predetermined algorithm, for example, SVM (Support Vector machine), DNN (Deep Neural Network), CNN (Convolutional Neural Network), or the like may be used. In addition to U-Net, FCN (Fully Convolutional Network), SegNet, and the like can also be used as an algorithm used for Semantic Segmentation that classifies classes in 1-pixel units.

Further, an algorithm in which a so-called generative model such as GAN (Generative Adversarial Networks) is combined with the above algorithm may be used. If there are multiple types of processing that you want to execute, build different learning models so that each processing can be executed. Further, in order to increase the amount of data used for learning, inflating (Data Augmentation) may be performed.

The padding in the present embodiment is to generate new data used for learning and increase the amount of data by performing at least one of rotation, inversion, luminance conversion, distortion addition, enlargement, and reduction, for example. .. Inflating data can also be rephrased as data augmentation. Further, when the input data is audio data, it is possible to generate new data used for learning and increase the amount of data by adding a sound combining sounds of one or more kinds of frequencies to the input data. it can.

Further, in order to suppress overfitting, it is preferable to divide the initial data set into a learning data set and an evaluation data set in advance.

In step S2-205, the information generated in step S2-204 is stored in the information storage unit 2-46 of the classifier.

In step S2-206, learning is performed using the information contained in the classifier saved in step S2-205 and the second learning data set. Here, the information contained in the classifier refers to the structure, weight, bias, and the like of the model. The weight and bias are parameters when calculating the output from the input. For example, in the case of a neural network, when x in the equation (2-1) is input, w is the weight and b is the bias. In this embodiment, the model structure is not changed, and training is performed so as to optimize the weights and biases for the second training data set.

After the learning is completed, the display control unit 2-48 displays the learning result on the display unit 2-140. In this case, the display control unit 2-48 controls to transmit the learning result to the display unit 2-140 connected to the learning device 2-100 and display the learning result on the display unit 2-140. In the present embodiment, the progress of learning can be confirmed by displaying the input image, the correct answer image, and the image subjected to the inference processing using the generated discriminator side by side. Further, in order to confirm the progress of learning in more detail, the value of IoU (described later) may be displayed.

The effect of the learning device according to the second embodiment will be described. In this embodiment, IoU (Intersection over Union) was used as an evaluation index in order to measure the effect. IoU is defined by equation (2-2).

Here, TP (True Positive) is the number of pixels that are carbon black determined to be carbon black, and FP (False Positive) is the number of pixels that are not carbon black are determined to be carbon black (number of false positives). Is. Further, FN (False Negative) is a number (undetected number) in which pixels that are carbon black are determined not to be carbon black. Here, IoUavg obtained by calculating the value of IoU for each of the 400 images for evaluation and averaging them was used. In the conventional method, IoUavg = 0.11, but by using this method, IoUavg = 0.36.

As described above, the learning device 2-100 in the present embodiment sequentially learns from the data set having a large amount of identification target information. Therefore, it is possible to first learn the parameters of image conversion having a large degree of conversion and gradually change the parameters, so that the identification target information can be accurately identified.

(Embodiment of 2-2)
(Overview)
An example of the second embodiment will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the above embodiment will be omitted, and the differences from the above embodiment will be mainly described.

(Functional configuration of learning system)
FIG. 15 is a diagram showing an example of the functional configuration of the learning system (identifier generation system) according to the second embodiment. The learning device (identifier generator) 2-200 includes a reception unit 2-41, an acquisition unit 2-42, a selection unit 2-43, a learning unit 2-44, a discriminator 2-45, and a display control unit 2-48. , Data expansion unit 2-49, and display unit 2-140.

The data expansion unit 2-49 expands the initial data set acquired by the acquisition unit 2-42. That is, the data expansion unit 2-49 can increase the number of images of input data.

(Processing procedure of learning device)
FIG. 16 is a flow chart showing an example of a method for generating a classifier according to the second embodiment. In step S2-301, the reception unit 2-41 receives the data set selection condition via the operation unit 2-150. The dataset selection criteria are entered by the user. In the present embodiment, the data set selection condition is at least the number of data expansions per image in the initial data set, the patch size, the method of dividing the data set, and the data used for training among the divided data sets. Includes set information and learning order. The patch size is the number of vertical and horizontal pixels of the selected image when a part of the image is selected. Here, the method of classifying the data set shall be based on the threshold value of the amount of identification target information. The amount of identification target information is defined by the number of pixels filled with the luminance value (0,255,0). Further, the threshold value is set to two, 5000pixel and 1000pixel.

In step S2-302, the acquisition unit 2-42 acquires the initial data set from the data server 2-120.

In step S2-303, the data expansion unit 2-43 expands the initial data set acquired by the acquisition unit 2-42. In the present embodiment, 2000 input data are generated by cutting out 100 images of patch size 128 × 128 from each of the initial data sets including 20 images of 40 pairs having a size of 1280 × 960. In addition, learning and evaluation were divided into 8: 2 and used.

FIG. 17 is a diagram showing an example of the data expansion processing procedure according to the second embodiment. The process of step S2-303 will be described with reference to FIG. Here, too, the carbon black in the melamine / alkyd resin is used as the identification target information. In step S2-303, the data expansion unit 2-43 expands the data by extracting a plurality of areas of interest with respect to the initial data set.

FIG. 17 shows an example in which the area of interest 2-540, the area of interest 2-541, and the area of interest 2-542 are extracted for each of the position coordinates 2-530, the position coordinates 2-531, and the position coordinates 2-532. Shown. The input image in this embodiment is composed of a plurality of pixels whose positions can be specified by two-dimensional Cartesian coordinates (x, y). Assuming that the number of pixels in the horizontal direction and the vertical direction of the image is x_size and y_size, respectively, 0 ≦ x ≦ x_size and 0 ≦ y ≦ y_size hold. X-axis upper left to the right direction to the origin of the image, taking the y-axis downward, a plurality of different coordinates from each other _{_{(x- i, y i -)}} (i = 1,2, ···, N) and then , 0 ≤ x- _i ≤ x_size, 0 ≤ y _i -≤ y_size. In the present embodiment, a set of random numbers (x- _i , y- _i- _{) satisfying 0 ≤ x-i} ≤ x_size and 0 ≤ y _i -≤ y_size is generated. Next, the region of interest is set with (x- _i , y- _{i-) as the upper left coordinate.} The size of the area of interest should be equal to the patch size. _{Further, (x- i, y i -} ) is positioned near the end of the image, when the size of the region of interest becomes smaller than the patch size may fill the periphery of the image pixel values 0, so-called padding process By doing something like this, adjust the size of the area of interest so that it is the same as the patch size.

In step S2-304, the selection unit 2-43 processes the initial data set acquired by the acquisition unit 2-42, and selects the first learning data set and the second learning data set.

The process of step S2-304 will be described. First, the data sets are sorted in order from the one with the largest amount of identification target information. That is, the data sets are sorted in descending order of the number of pixels filled with the luminance value (0,255,0). Next, the data set is divided according to the threshold value received by the reception unit 2-41. Next, the learning process is determined according to the information of the data set used for learning received by the reception unit 2-41 and the learning order. Here, a data set containing an image having an amount of identification target information of 5000 pixels or more is referred to as a first training data set, and a data set containing an image having an amount of identification target information of 1000 pixels or more is referred to as a second learning data set. To do. Further, a data set including an image in which the amount of identification target information is 0pixel or more may be used as a third learning data set for further learning. As described above, the second training data set includes the first training data set, and the third training data set includes the first training data set and the second training data set. .. This makes it possible to generate a classifier with higher discrimination accuracy.

In step S2-305, the learning unit 2-44 executes learning using the first learning data set selected by the selection unit.

In step S2-306, the information generated in step S2-305 is stored in the information storage unit 2-46 of the classifier.

In step S2-307, the information contained in the classifier is obtained by learning using the information stored in the classifier stored in the information storage unit in step S2-306 and the second learning data set. Update. Here, the information contained in the classifier refers to the structure, weight, bias, and the like of the model. Further, further learning may be performed using the information contained in the discriminator generated by the learning using the second learning data set and the third learning data set. Further, although the number of data sets may be large, the amount of identification target information preferably decreases monotonically as n increases when the number of learning steps is n (n is an integer of 2 or more). .. That is, it is preferable that the slope when plotting the amount of identification target information with respect to the number of learnings is negative. After the learning is completed, the display control unit 2-48 causes the display unit 2-140 to display the learning result.

The effect of the learning device according to the second embodiment will be described. In this embodiment, 400 out of 2000 images generated by expanding the initial data set were used as evaluation images. In addition, IoUavg was used as in the second embodiment. In the conventional method, IoUavg = 0.13, but by using this method, IoUavg = 0.56.

As described above, the learning device 2-100 in the present embodiment can accurately identify the identification target information by sequentially learning from the data set having a large amount of identification target information.

Also, by expanding the data, it can be applied even when a lot of input data cannot be prepared.

(Embodiment 2-3)
(Overview)
Next, an example of the second and third embodiments will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the above embodiment will be omitted, and the differences from the above embodiment will be mainly described. In the present embodiment, red blood cells, white blood cells, and platelets in blood in the optical microscope image will be described as an example of the object to be processed included in the image to be processed.

Further, in the present embodiment, the data set is automatically selected, and the learning process is repeated until the evaluation value reaches the target value. The red blood cell part in the image is filled with the brightness value (255,0,0), the white blood cell part is filled with the brightness value (0,255,0), and the platelet part is filled with the brightness value (0,0,255). There was. Then, 2000 images of size 128 × 128 and 1000 pairs of initial data sets were used. The learning and evaluation were divided into 8: 2 and used. FIG. 18 is a diagram showing an example of input data according to the second and third embodiments.

(Functional configuration of learning system)
FIG. 19 is a diagram showing an example of the functional configuration of the learning system (identifier generation system) according to the second to third embodiments. The learning device (identifier generator) 2-300 has a functional configuration of a reception unit 2-41, an acquisition unit 2-42, a selection unit 2-43, a learning unit 2-44, a discriminator 2-45, and an evaluation unit. It has at least 2-50, a display control unit 2-48, and a display unit 2-140.

Evaluation unit 2-50 makes inferences using the discriminator stored in discriminator 2-45, ends learning when the value of IoUavg is higher than the target value, and ends learning when the value of IoUavg is lower than the target value. , Repeat the learning process.

(Processing procedure of learning device)
FIG. 20 is a flow chart showing an example of a method for generating a classifier according to the second and third embodiments. In step S2-401, the reception unit 2-41 receives the data set selection condition via the operation unit 2-150. The dataset selection criteria are entered by the user. In the present embodiment, the selection condition includes at least the target value of IoU, the upper limit learning time, and the initial value of the class width. Further, here, the method of classifying the initial data set is to classify the initial data set according to the amount of identification target information. The initial value of the class width is 1000.

In step S2-402, the acquisition unit 2-42 acquires the initial data set from the data server 2-120.

In step S2-403, the selection unit 2-43 processes the initial data set acquired by the acquisition unit 2-42, and selects the first learning data set and the second learning data set.

The process of step S2-403 will be described with reference to FIG. The selection unit 2-43 divides the initial data set into classes according to the initial value of the width of the class received by the reception unit 2-41. Here, the data set belonging to the class having the largest amount of identification target information is set as the first learning data set, and belongs to the class having the largest amount of identification target information and the class having the second largest amount of identification target information. The combined data set is used as the second training data set.

In step S2-404, the learning unit 2-44 executes learning using the first learning data set selected by the selection unit.

In step S2-405, the information generated in step S2-404 is stored in the information storage unit 46.

In step S2-406, the information contained in the classifier is obtained by learning using the information stored in the classifier stored in the information storage unit in step S2-405 and the second learning data set. Update. Here, the information contained in the classifier refers to the structure, weight, bias, etc. of the model.

In step S2-407, the evaluation unit 2-50 makes an inference using the classifier 2-45, ends learning when the value of IoUavg is higher than the target value, and ends learning when the value of IoUavg is lower than the target value. , Repeat the learning process. After the learning is completed, the display control unit 2-48 causes the display unit 2-140 to display the learning result.

The effect of the image processing device according to the second to third embodiments will be described. In this embodiment, evaluation was performed using mIoU. mIoU is defined by equation (2-3).

C is the number of classification classes, and in this embodiment, c = 3. In the conventional method, IoUavg = 0.08, but by using this method, IoUavg = 0.45.

As described above, according to the second embodiment of the present invention, even when a plurality of identification target information exists in one data or when it is difficult to distinguish between the identification target information and other information, identification is performed. It is possible to provide a method of generating a classifier that can accurately identify target information. Further, according to the present invention, it is possible to provide an identification method and an identification device using the identification device generated by the identification device generation method capable of accurately identifying the identification target information.

<Other Embodiments>
The learning device and the learning system in each of the above-described embodiments may be realized as a single device, or may be a form in which devices including a plurality of information processing devices are combined so as to be able to communicate with each other to execute the above-mentioned processing. Both are included in the embodiments of the present invention. The above-mentioned processing may be executed by a common server device or a group of servers. In this case, the common server device corresponds to the learning device according to the embodiment, and the server group corresponds to the learning system according to the embodiment. The learning device and the plurality of devices constituting the learning system need not be present in the same facility or in the same country as long as they can communicate at a predetermined communication rate.

Although the embodiment examples have been described in detail above, the present invention can take an embodiment as, for example, a system, an apparatus, a method, a program, or a recording medium (storage medium). Specifically, it may be applied to a system composed of a plurality of devices (for example, a host computer, an interface device, an imaging device, a Web application, etc.), or it may be applied to a device composed of one device. good.

<< Third Embodiment >>
(Background of the third embodiment)
In recent years, many attempts have been made to process various data using deep learning to obtain useful information. For example, image processing, voice processing, text processing and the like are known. By using deep learning, the identification accuracy is improved as compared with the conventional method, but various efforts are being made to further improve the identification accuracy.

The above document 2-1 describes a diagnostic support device that supports diagnosis of a diseased area by using deep learning. This technique performs highly accurate diagnosis by normalizing the color brightness of an image in advance and separating the diseased part and the non-diseased part.

Further, the above-mentioned Document 2-2 discloses a technique for accurately identifying a nodule from a nodule candidate image by connecting a plurality of classifiers and learning while removing a sample that is clearly normal. Connecting a plurality of classifiers in this way is called a cascade type classifier, and is a technique often used to improve the discrimination accuracy.

(Problem to be solved in the third embodiment)
However, when there are a plurality of areas to be identified (hereinafter referred to as identification target areas) in one data, or when it is difficult to distinguish between the identification target area and other areas, there are the following problems. That is, there is a problem that it is difficult to perform the separation process in advance as in the above-mentioned document 2-1 and to learn while removing the sample which is clearly normal as in the above-mentioned document 2-2.

(Method of generating a classifier)
The method of generating the classifier according to the present embodiment is a method of generating the classifier for estimating the identification target information in the data. Specifically, a padding step (S3-102) in which training data is padded with respect to the training data set group, and a generation step (S3-102) in which a classifier is generated by performing training using the padded learning data set group. S3-103) and at least (FIG. 21).

Here, the training data set group includes at least the first training data set and the second training data set. The first and second training data sets include training data. The learning data is composed of input data and teacher data for the input data. The second training data set contains a larger number of training data than the first training data set.

Further, the amount of identification target information contained in the input data included in the first learning data set is larger than the amount of identification target information contained in the input data included in the second learning data set.

The present inventors have found that if the first learning data set and the second learning data set are trained without going through the padding step, the identification target information cannot be accurately identified. It is considered that this is due to the small amount of identification target information of the input data included in the second learning data set. That is, it was found that when learning is performed with input data in which the amount of identification target information is small, inference without the identification target information tends to be performed even if the inference data includes the identification target information. .. Therefore, the training data of the first training data set having the input data having a large amount of identification target information is inflated, and the number of training data included in the first training data set is included in the second training data set. Make sure that the number of training data is greater than or equal to the number of training data. By doing so, the amount of input data having a large amount of identification target information increases, and the identification target information can be accurately identified.

It should be noted that the learning data set group reception process (S3-101) may be provided.

Here, the terms used in the above explanation will be explained.

(data)
In the present embodiment, the data is an expression of information, which is formalized to be suitable for transmission, interpretation or processing, and can be reinterpreted as information. Examples of data include image data, sound data (voice data, etc.), text data, and the like. In the case of data, image data, sound data, and text data, the input data is input image data, sound input data, and input text data.

(Identification target information)
In the present embodiment, the identification target information is information to be identified in the data. When the data is image data, for example, at least one piece of information on the position, area, and distribution of the identification target area in the image data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract the identification target area in the image data, which is difficult to extract visually by the user. When the data is image data, the amount of identification target information can be the number of pixels included in the identification target area.

When the data is sound data, for example, at least one of the frequency and intensity of the sound to be identified (identification target sound) in the sound data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract the sound to be identified in the noise-containing sound data, which is difficult for the user to extract.

Further, when the sound data is the voice data of a plurality of speakers, the voice data of at least one speaker can be used as the identification target information.

When the data is text data, for example, the information of the character, the character string, and the number of the identification target characters in the text data is the identification target information. The classifier generated by the generation method according to the present embodiment can estimate and extract a character string to be identified in text data, which is difficult for the user to extract.

(Learning data)
The learning data in the present embodiment is learning data for generating a discriminator, and is composed of input data and teacher data for the input data. When the input data is image data (input image data), the teacher data is the image data with the identification target information attached. For example, the identification target area is shown in the image data.

(Amount of information to be identified)
In the present embodiment, the amount of identification target information contained in the input data is, for example, the ratio of the identification target area to the image data when the input data is image data. That is, a large amount of identification target information means that, for example, when the input data is image data, the ratio of the identification target region to the image data is large. Further, when the input data is sound data, a large amount of identification target information means that the intensity of the identification target sound in the sound data is large, or the sound data is voice data of a plurality of speakers. In the case of, it means that the number of speakers to be extracted is large.

When the input data is text data, a large amount of identification target information means, for example, a large number of characters or character strings to be identified in the text data.

(Learning data set)
The training data set in the present embodiment includes the above-mentioned training data. The number of training data contained in the second training data set is larger than the number of training data contained in the first training data set.

The learning data set group in the present embodiment includes at least a first learning data set and a second learning data set. The training data set group may include three or more training data sets.

(Inflated)
In the present embodiment, the data padding is to generate new input data and increase the number of input image data by performing at least one of rotation, inversion, luminance conversion, distortion addition, enlargement, and reduction, for example. That is. Inflating data can also be rephrased as data augmentation. When the input data is sound, it is possible to generate new sound input data and inflate it by adding a sound that is a combination of sounds of one type or a plurality of types to the input data.

(Identifier generator)
The classifier generator according to the present embodiment is a classifier generator for estimating identification target information in data. Specifically, the inflated unit 3-22 that inflates the learning data for the learning data set group and the generation unit 3-23 that generates a classifier by performing learning using the inflated learning data set group. And at least (Fig. 22).

The generation device according to the present embodiment can be configured such that the acquisition unit 3-21 acquires the learning data set group by operating the operation unit 3-150. Further, the generator according to the present embodiment can be configured to send and receive data to and from the data server 3-120.

Note that each configuration and terminology of the classifier generator according to the present embodiment is the same as the classifier generation method, and thus description thereof will be omitted.

(Identifier)
The classifier according to the present embodiment is generated by the generation method and the generation device according to the present embodiment. The discriminator generated by the generation method and the generation device according to the present embodiment can accurately infer the identification target information included in the input inference data.

(Information processing device, information processing method)
The information processing apparatus according to the present embodiment includes the above-mentioned classifier, and has an inference unit that infers the identification target information included in the data for inference using the classifier. Similarly, the information processing method according to the present embodiment includes the above-mentioned classifier and has an inference step of inferring the identification target information included in the inference data using the classifier.

Hereinafter, the method of generating the classifier and the generator according to the embodiment of the present invention will be described in detail while showing specific examples.

(Embodiment of 3-1)
(Overview)
The method or apparatus for generating the classifier according to the present embodiment will be described with reference to an example in which the data is image data.

First, we prepare learning data consisting of an input image for learning (input data or input image data) and a correct answer image for learning (teacher data) in which the identification target area (identification target information) is colored with a certain color. Then, training is performed using this to create a trained model.

In the following description, a case where the image region of the resin on the transmission electron microscope (TEM) image is the target of image processing will be described, but the scope of application of this embodiment is the detection target and the type of image acquisition method. It is not limited to. Hereinafter, a specific device configuration, functional configuration, and processing flow will be described.

(Device configuration)
An area identification system composed of a classifier learning device (classifier generation device) according to the third embodiment of the present invention and each device connected to the classifier learning device 3-100 based on FIG. 23. (Identifier generation system) 3-190 will be described in detail.

The area identification system 3-190 has a data input device 3-110 that captures an image for learning, and a data server 3-120 that stores the captured image. Then, the user identifies the area of the image, and has a data processing device 3-130 for coloring the identified area and a classifier learning device 3-100 for learning the classifier. Then, it has a display unit 3-140 for displaying the learning result and the frequency distribution, and an operation unit 3-150 for the user to input an operation instruction of the discriminator learning device. The classifier learning device 3-100 acquires a learning input image and a learning correct answer image at the time of learning, learns them, and outputs a learned model.

In addition, inference can be performed using the classifier generated by the classifier learning device 3-100. At the time of inference, an input image for inference is acquired, the generated trained model is used, an identification area in the input image is extracted, the entire area or its boundary is colored with a certain color, and the image is output as an inferred image. be able to.

Each part will be explained below. The classifier learning device 3-100 has at least 3-CPU31, communication IF3-32, ROM3-33, RAM3-34, storage unit 3-35, and common bus 3-36. The CPU 3-31 integrally controls the operation of each component of the classifier learning device 3-100.

By controlling the CPU 3-31, the classifier learning device 3-100 may also control the operation of the data input device 3-110. The data server 3-120 holds an image taken by the data input device 3-110. Communication IF (Interface) 3-32 is realized by, for example, a LAN card. The communication IF3-32 controls communication between the external device (for example, the data server 3-120) and the classifier learning device 3-100. The ROM 3-33 is realized by a non-volatile memory or the like, stores a control program executed by the CPU 3-31, and provides a work area when the program is executed by the CPU 3-31. The RAM (Random Access Memory) 3-34 is realized by a volatile memory or the like, and temporarily stores various information. The storage unit 3-35 is realized by, for example, an HDD (Hard Disk Drive) or the like. Then, various application software including an operating system (OS: Operating System), a device driver of a peripheral device, and a program for identifying an area according to the present embodiment described later are stored. The operation unit 3-150 is realized by, for example, a keyboard, a mouse, or the like, and inputs an instruction from the user into the device. The display unit 3-140 is realized by, for example, a display or the like, and displays various information toward the user. The operation unit 3-150 and the display unit 3-140 provide a function as a GUI (Graphical User Interface) under the control of the CPU 3-31. The display unit 3-140 may be a touch panel monitor that accepts operation input, and the operation unit 3-150 may be a stylus pen. Each of the above components is communicably connected to each other by a common bus 3-36.

The data input device 3-110 is, for example, a scanning electron microscope (SEM), a transmission electron microscope (TEM: Transmission Electron Microscope), an optical microscope, a digital camera, a smartphone, or the like. The data input device 3-110 transmits the acquired image to the data server 3-120. An imaging control unit (not shown) that controls the data input device 3-110 may be included in the classifier learning device 3-100.

(Functional configuration)
Next, the functional configuration of the area identification system including the classifier learning device 3-100 according to the present embodiment will be described with reference to FIG. 24. When the CPU 3-31 executes the program stored in the ROM 3-33, the functions of the respective parts shown in FIG. 24 are realized. The main body that executes the program may be one or more CPUs, and the ROM that stores the program may also be one or more memories. Further, another processor such as GPU (Graphics Processing Unit) may be used instead of the CPU or in combination with the CPU. That is, the functions of the respective parts shown in FIG. 24 are realized by executing a program stored in at least one or more memories in which at least one or more processors (hardware) are communicably connected to the processors.

The classifier learning device 3-100 has a functional configuration of a reception unit 3-41, an acquisition unit 3-42, a frequency distribution calculation unit 3-44, a data expansion unit 3-45, a learning unit 3-46, and a storage unit 3-47. , And display control unit 3-48. Further, it may have an extraction unit 3-43. The classifier learning device 3-100 is communicably connected to the data server 3-120 and the display unit 3-140.

The reception unit 3-41 receives the data expansion condition input from the user via the operation unit 3-150. That is, the operation unit 3-150 corresponds to an example of the reception means for setting the extended condition, patch size (described later), and receiving. The expansion condition includes at least one of a frequency distribution (described later), a number of bins, a bin width, and an augmentation method (described later). A bin is an interval or class in which each is relatively prime in a frequency distribution (histogram).

The acquisition unit 3-42 acquires a plurality of learning data (which can also be called a learning data pair) composed of a learning input image and a learning correct answer image from the data server 3-120.

When the extraction unit 3-43 has, a plurality of small area (data block) pairs are extracted from each of the learning input image and the learning correct answer image based on the patch size received by the reception unit 3-41. ..

The frequency distribution calculation unit 3-44 determines the area or the number of pixels of the extraction region for each of a plurality of correct answer images for learning or, if there is an extracted data block group, the data block group extracted from the correct answer image for learning. calculate. Further, using the number of bins and the width of the bins received by the reception unit 3-41, a frequency distribution is created with the calculated area or the number of pixels as the characteristic value.

The data expansion unit 3-45 expands the data of the learning input image and the learning correct answer image based on the created frequency distribution and the instruction to execute the augmentation received by the reception unit 3-41.

Learning unit 3-46 learns based on the above teacher data and creates a learned model. The storage unit 3-47 stores the trained model. Subsequently, the display control unit 3-48 uses the display unit 3-140 to output information on the frequency distribution and the learning result.

On the other hand, in the inference from the input data for inference, the start command of the inference operation input from the user is received via the operation unit 3-150.

Acquisition unit 3-42 acquires an inference image from the data server 3-120.

The inference unit (not shown) makes inferences based on teacher data 3-49. Subsequently, the display control unit 48 outputs the inference result using the display unit 3-140.

Note that at least a part of each part of the classifier learning device 3-100 may be realized as an independent device. The classifier learning device 3-100 may be a workstation. The functions of each part may be realized as software that operates on a computer, and the software that realizes the functions of each part may be realized on a server via a network such as a cloud. In the present embodiment described below, it is assumed that each part is realized by software running on a computer installed in a local environment.

(Processing flow)
Subsequently, a method for generating a classifier according to the third embodiment of the present invention will be described. FIG. 25 is a diagram showing a processing procedure of processing executed by the classifier learning device 3-100 of the present embodiment. This embodiment is realized by the CPU 3-31 executing a program that realizes the functions of each part stored in the ROM 3-33. In the present embodiment, the image to be processed will be described as a TEM image. The TEM image is acquired as a two-dimensional shading image. Further, in the present embodiment, the identification target included in the image will be described as an example of the processing target object included in the processing target image.

First, the processing related to learning in steps S3-201 to S3-207 will be described.

In step S3-201, the reception unit 3-41 receives the data expansion condition input by the user in the operation unit 3-150. The data expansion condition in the present embodiment includes at least one of the number of bins, the width of the bins, and the augmentation method regarding the frequency distribution to be created.

In step S3-202, the acquisition unit 3-42 acquires a learning data pair consisting of a learning input image and a learning correct answer image from the data server 3-120. For the learning input image and the learning correct answer image used at this time, exactly the same image pair can be used except that the entire extraction region portion or the boundary portion is colored.

When the extraction unit 3-43 is provided, in step S3-202b, a small area (data block) pair is extracted from the learning input image and the learning correct answer image according to the patch size. Here, the patch size is the number of pixels in the vertical and horizontal directions of the cropped image when a part of the target image is cropped. Each pair of extracted data blocks is extracted from the same coordinates on the image.

In step S3-203, the frequency distribution calculation unit 3-44 for a plurality of learning correct answer images, and for the data block group extracted from the learning correct answer image when the extraction unit 3-43 is provided, respectively. The area of the extraction area is calculated, and a frequency distribution is created using this area value as a characteristic value.

In step S3-204, the data expansion unit 3-45 expands the data of the learning input image and the learning correct answer image based on the frequency distribution and the instruction to execute the augmentation received by the reception unit 3-41. Do. Specifically, in addition to rotating the image, a method called augmentation such as inversion, enlargement, reduction, distortion addition, and brightness change is used to increase the input image for learning and the correct answer image for learning so that they are included in the same frequency distribution. .. By this method, teacher data is generated in which the frequency of the bin containing a large amount of the identification target area is higher than the frequency of the bin containing a smaller identification target area.

Augmentation executes, for example, rotation, inversion, enlargement, reduction, etc., and each process can be performed as follows. That is, a blank image (white) having a size 10 times the length and width of the patch size is prepared in advance, and the image to be rotated is arranged in the center portion thereof. Next, the affine transformation is performed at each coordinate according to Eq. (3-1) and Table 1. In the equation, x and y indicate the coordinates before conversion, and x'and y'indicate the coordinates after conversion. Further, in a normal case, θ may be set between 30 ° and 330 ° in terms of rotation angle. Further, a and d are enlargement / reduction ratios in the vertical and horizontal directions, respectively, and are usually set between 0.1 and 10. Next, the center is cut out at the patch size to make the image after augmentation.

As an example of distortion, an arbitrary value is added to the x-coordinate and translated, but this arbitrary value is changed according to the y-coordinate. The maximum value of any value is usually preferably between 20% and 60% of the length of the patch size in the X direction.

On the other hand, gamma correction can be used as an example of changing the brightness. The gamma value at this time is usually 1.2 or more or 1 / 1.2 or less.

Further, linear interpolation processing may be performed on the augmentation image. This makes it possible to smooth out a mosaic-like jagged image.

In step S3-205, the learning unit 3-46 generates a trained model 3-49 by performing machine learning according to a predetermined algorithm using the learning teacher data. In this embodiment, it is preferable to use U-Net as a predetermined algorithm. Since the learning method of U-Net is a well-known technique, detailed description thereof will be omitted in the present embodiment. Further, as a predetermined algorithm, for example, SVM (Support Vector machine), DNN (Deep Neural Network), CNN (Convolutional Neural Network), or the like may be used. In addition to U-Net, FCN (Fully Convolutional Network), SegNet, and the like can also be used as an algorithm used for Semantic Segmentation that classifies classes in 1-pixel units. Further, an algorithm in which a so-called generative model such as GAN (Generative Adversarial Networks) is combined with the above algorithm may be used.

In step S3-206, the storage unit 3-47 stores the trained model.

In step S3-207, the display control unit 3-48 uses the display unit 3-140 to output information related to the frequency distribution and learning.

On the other hand, the inference process is almost the same as the learning process, so it will be explained briefly below. That is, the same processing as in step S3-201 (not shown) is performed except that the information received by the reception unit is an inference start command instead of the data extension condition.

Next, a process (not shown) similar to step S3-202 is performed except that the inference input data is acquired from the acquisition unit instead of the learning data pair.

Further, in a step (not shown), the area is inferred using the same algorithm as the learning process by using the learning data and the inference input data.

Finally, the same non-illustrated process as in step S3-207 is performed except that the inference result is output instead of the information related to the frequency distribution and learning.

The inference accuracy can be improved by the above processing.

<Other Embodiments>
The data handled in the third embodiment can be audio data instead of an image, and the input device can be a microphone. Further, by supporting voice data such as using the difference amount between the learning input data and the learning correct answer data instead of the area, it can be used for voice processing such as speaker identification and noise cancellation in the voice data.

For example, in an example of speaker recognition, all voices are separated for each frequency band, and the sound quality of the frequency component having the characteristics of the speaker to be identified is rewritten to a constant volume as the correct answer data for the teacher. When learning is performed based on this, it is possible to identify which voice component in all voices is the voice of the speaker to be identified. By using this classifier, it is possible to clarify that only the voice of a specific speaker is extracted from all the voices.

On the other hand, in an example of noise cancellation, it is possible to identify which voice component in all voices is unnecessary voice, that is, noise by the same method. By using this classifier, it is possible to clarify the noise by eliminating it from all the voice.

The processing content is the same as that of the third embodiment except that the data expansion method is to increase / decrease the volume, frequency, and speed.

The classifier learning device and the area identification system in each of the above-described embodiments may be realized as a single device, or as a mode in which devices including a plurality of information processing devices are combined so as to be able to communicate with each other to execute the above-described processing. Also, both are included in the embodiments of the present invention. The above-mentioned processing may be executed by a common server device or a group of servers. In this case, the common server device corresponds to the classifier learning device according to the embodiment, and the server group corresponds to the area identification system according to the embodiment. The classifier learning device and the plurality of devices constituting the area identification system need only be able to communicate at a predetermined communication rate, and do not need to exist in the same facility or in the same country.

Hereinafter, a third embodiment of the present invention will be described in more detail with reference to Examples and Comparative Examples. The present invention is not limited to the following examples.

(Example 1)
In this embodiment, the classifier learning device of the embodiment of the present invention was used to grasp the amount of magenta pigment in the cross-sectional TEM image of the color toner.

(Toner preparation)
A pulverized toner containing a magenta pigment was obtained according to a conventional method. As a method for obtaining the pulverized toner, the methods described in JP-A-2010-140062 and JP-A-2003-233215 can be used.

(TEM observation of toner)
Cross-sectional observation of the toner with a transmission electron microscope (TEM) can be performed as follows.

Using an osmium plasma coater (filgen, OPC80T), an Os film (5 nm) and a naphthalene film (20 nm) were applied to the toner as a protective film, and the toner was embedded with a photocurable resin D800 (JEOL Ltd.). Then, a toner cross section having a film thickness of 60 nm (or 70 nm) was prepared at a cutting speed of 1 mm / s by an ultrasonic ultramicrotome (UC7, Leica).

The cross section of the produced toner was observed using TEM (JEOL, JEM2800). FIG. 26 shows an example of a learning input image cut out from the TEM image of the toner.

(Correct image for learning)
A correct answer image for learning was created from the 18 TEM images. The image processing software Photoshop 5.5 manufactured by Adobe Systems Incorporated was used to create the magenta pigment portion, which was colored black (brightness 0/256). FIG. 26 shows an example of a colored correct answer image for learning.

The batch size was 128 × 128, and 100 small areas (data blocks) at the same position for each of the learning input image and the learning correct answer image were created for a total of 1800 pairs.

Next, in the 1800 correct answer images for learning, a frequency distribution was created using the number of pixels in the region where the characteristic values were colored and the values shown in Table 1 for the number of bins and the width.

In this embodiment, data expansion such as rotation, inversion, enlargement, reduction, distortion addition, and brightness change was performed under the conditions shown in Table 2, and learning was performed to create a classifier.

(Example 2)
In this example, the same as in Example 1 except for the data expansion condition, the same TEM image was used to learn for measuring the amount of magenta pigment in the toner, and a classifier was created. As shown in Table 1, the data expansion conditions are such that the larger the number of pixels in the target area, the higher the frequency.

(Comparative Example 1)
In this comparative example, a discriminator was created by learning for measuring the amount of magenta pigment in the toner using the same TEM image as in Example 1 except that the data was not expanded.

(Example 3)
In this embodiment, the classifier learning device of the embodiment of the present invention is used to identify the area of the automobile part in order to measure the number of cars in the city from the aerial photograph. The image used is https: // gdo152. llnl. Four aerial photographs of Potsdam City obtained from gov / cowc / (as of October 2019).

As shown in FIG. 27, a correct answer image for learning was created by coloring in the same manner as in Example 1.

In this embodiment, data expansion such as rotation, inversion, enlargement, reduction, distortion addition, and brightness change was performed under the conditions shown in Table 3, and learning was performed to create a classifier.

(Comparative Example 2)
In this comparative example, a classifier was created by learning to measure the number of cars in the city from the same aerial photograph as in Example 3 except that the data was not expanded.

(Evaluation criteria)
The effect (discrimination accuracy) of the discriminator generated by the discriminator learning device according to the embodiment will be described. In this embodiment, IoU (Intersection over Union) was used as an evaluation index to measure the effect. IoU is defined by equation (3-2).

Here, TP (True Positive) is a number of pixels that are magenta pigments that are determined to be magenta pigments. Further, FP (False Positive) is the number of pixels that are not magenta pigments are determined to be magenta pigments (false positives), and FN (FalseNegative) is the number of pixels that are magenta pigments are determined to be non-magenta pigments (undetected). Number). The results of Examples 1 and 2 and Comparative Example 1 were compared in Table 4, and the results of Example 3 and Comparative Example 2 were compared in Table 5.

In any case, the IoU value of the example according to the embodiment of the present invention is larger than the IoU value of the comparative example, and it can be confirmed that the identification accuracy is improved. That is, in the classifier generated by the generation method according to the above embodiment, since the input data having a large number of pixels in the magenta pigment region and the automobile portion region increases, the identification target information (magenta pigment region and automobile portion) It was found that the area) can be identified with high accuracy.

As described above, according to the method for generating a classifier according to the third embodiment of the present invention, a classifier with high inference accuracy can be generated.

<< Fourth Embodiment >>
(Overview)
The fourth embodiment of the present invention is a combination of the first embodiment, the second embodiment, and the third embodiment of the present invention. By combining the first embodiment, the second embodiment, and the third embodiment, the effect of further improving the identification accuracy can be obtained. An example of the fourth embodiment will be described in detail with reference to the drawings. The description of the configuration, function, and operation similar to those of the first to third embodiments will be omitted, and the differences from the above embodiments will be mainly described.

In this embodiment, an example in which the image to be processed is a TEM image will be described. The TEM image is acquired as a two-dimensional shading image. Further, carbon black in the coating film of the melamine / alkyd resin paint will be described as an example of the object to be identified. In the present embodiment, of the initial data set including 25 images and 50 pairs of images having a size of 1280 × 960, 40 pairs of 20 images were used for learning and 10 pairs of 5 images were used for evaluation. Further, at the time of learning, 2000 input data were generated by cutting out 100 images each having a patch size of 128 × 128 from the learning image. The value of the maximum area / minimum area of the identification object in one cropped image was 30 to 120, and the amount of the identification object in one cropped image was 0pixel to 16384pixel.

For the evaluation, as in the second embodiment, IoUavg obtained by calculating the value of IoU for each image for evaluation and averaging was used.

(Embodiment of 4-1)
In the processing flow, the learning processing part was the same as that of the second embodiment, and the inference processing part was the same as that of the 1-1 embodiment.

In the conventional method, IoUavg = 0.0, but when the average number of inferences was 30 times, IoUavg = 0.85.

(Embodiment of 4-2)
In the processing flow, the learning processing portion is the same as that of the 3-1 embodiment, and the inference processing portion is the same as that of the 1-1 embodiment.

In the conventional method, IoUavg = 0.0, but when the average number of inferences was 30 times, IoUavg = 0.81.

(4th-3rd Embodiment)
In the learning processing part of the processing flow, as described in the first embodiment in the third embodiment, the data of the image having a low power is expanded to the same power as the image having a high power, and then the image of the second embodiment is 2-1. As described in the above, the learning was performed in two stages. The inference processing part is the same as that of the first embodiment.

In the conventional method, IoUavg = 0.0, but when the average number of inferences was 30 times, IoUavg = 0.87.

Table 6 shows a list of processing contents, identification targets, and evaluation values of each of the above embodiments.

The present invention is not limited to the above-described embodiment, and various modifications and modifications can be made without departing from the spirit and scope of the present invention. Therefore, the following claims are attached in order to publicize the scope of the present invention.

This application has priority based on Japanese Patent Application Patent Application No. 2019-199099 submitted on October 31, 2019 and Japanese Patent Application No. 2019-217334 and Japanese Patent Application No. 2019-217335 submitted on November 29, 2019. It is an assertion, and all the contents of the description are incorporated here.

Claims

An image processing device that acquires information on a specific area in an image based on inference.
It has an information acquisition means for acquiring the information of the specific region, which is inferred by inputting each of the information of a plurality of regions of interest extracted from the image based on a predetermined inference condition into the trained model. ,
The plurality of attention regions include a first attention region and a second attention region, and the first attention region and the second attention region do not overlap with each other and overlap with each other. An image processing device having an area.
The image processing apparatus according to claim 1, wherein the size of the first attention region and the size of the second attention region are the same as each other.
The image processing device according to claim 1 or 2, further comprising a reception unit that accepts the setting of the inference condition.
The image processing apparatus according to any one of claims 1 to 3, wherein the information acquisition means has an extraction unit that extracts the plurality of areas of interest from the image based on the inference conditions received by the reception unit.
The image processing device according to claim 4, wherein the extraction unit extracts the plurality of areas of interest using random numbers.
The information acquisition means acquires a plurality of inference results by inputting each of the plurality of attention regions extracted by the extraction unit into the trained model, and based on the plurality of inference results, the specific region. The image processing apparatus according to claim 4 or 5, which has an information acquisition unit for acquiring information.
The threshold value of the ratio of the number of inferences that the inference condition is performed on average for each pixel of the image, the number of times that the specific area is inferred from the attention area, and the number of times that the attention area is inferred. , And the image processing apparatus according to any one of claims 1 to 6, which includes at least one of the sizes of the region of interest.
The image processing apparatus according to any one of claims 1 to 7, wherein the image includes a plurality of the specific regions and the areas of the plurality of specific regions have a distribution.
The invention according to any one of claims 1 to 8, wherein the image includes a plurality of the specific regions, and the ratio of the maximum value of the area of the plurality of specific regions to the minimum value of the area of the plurality of specific regions is 50 or more. The image processing apparatus described.
The image processing apparatus according to any one of claims 1 to 9, wherein the image includes a plurality of the specific areas, and the ratio of the maximum value to the minimum value of the area of the plurality of specific areas is 100 or more.
A display control unit that causes the display unit to display information on the specific area so that the display mode of the specific area in the image differs from the display mode other than the specific area based on the information of the specific area. The image processing apparatus according to any one of claims 1 to 10.
The image processing apparatus according to any one of claims 1 to 11, wherein the trained model is obtained by learning an image in which information in the specific area is known as teacher data.
The image processing apparatus according to any one of claims 1 to 12, wherein the image is an image taken by any one of a scanning electron microscope, a transmission electron microscope, and an optical microscope.
The image processing apparatus according to any one of claims 1 to 13, wherein the image is an image including an image of a first material and an image of a second material different from the first material.
The image processing apparatus according to claim 14, wherein the information of the specific area includes at least one information of the position of the image of the second material in the image and the size of the image of the second material.
The image processing apparatus according to any one of claims 1 to 15, wherein the information of the region of interest includes information on at least one of the position and size of the region extracted from the image.
It is a control method of an image processing device that acquires information of a specific region in an image based on inference, and has learned each of the information of a plurality of regions of interest extracted from the image based on a predetermined inference condition. It has an information acquisition process for acquiring information in the specific area, which is inferred by inputting to the model.
The plurality of attention regions include a first attention region and a second attention region, and the first attention region and the second attention region do not overlap with each other and overlap with each other. A method of controlling an image processing device having an area.
It is a method of generating a classifier for identifying the identification target information in the data.
A first learning step of learning using the first learning data set among the initial data sets including a plurality of learning data created from the above data, and
The information included in the classifier generated by learning in the first learning step and the information included in the classifier by learning using the second learning data set of the initial data sets. Has a second learning process, which updates the information
Generation of a classifier characterized in that the amount of the identification target information included in the first learning data set is larger than the amount of the identification target information included in the second learning data set. Method.
The method for generating a classifier according to claim 18, wherein the second learning data set includes the first learning data set.
The amount of the identification target information contained in each data set of the first learning data set and the second learning data set is the sum of the amounts of the identification target information contained in each data set in each data set. The method for generating a classifier according to claim 18 or 19, which is a value divided by the number of training data included.
The number of learning steps is n (n is an integer of 2 or more),
The method for generating a classifier according to any one of claims 18 to 20, wherein the amount of the identification target information decreases monotonically as the n increases.
The method for generating a classifier according to any one of claims 18 to 21, wherein the data is image data and the identification target information is an identification target area.
The method for generating a classifier according to claim 22, wherein the amount of the identification target information is the area of the identification target area in the image data.
The method for generating a classifier according to any one of claims 18 to 23, wherein the initial data set includes an image generated by selecting a part of the image data.
The method for generating a classifier according to any one of claims 18 to 21, wherein the data is voice data.
The method for generating a classifier according to any one of claims 18 to 21, wherein the data is text data.
The method for generating a classifier according to any one of claims 18 to 26, wherein the first learning data set and the second learning data set are automatically determined from the initial data set.
An identification method for identifying the identification target information in the data by using the identification device generated by the identification device generation method according to any one of claims 18 to 27.
A discriminating device having a discriminator generated by the method for generating a discriminator according to any one of claims 18 to 27.
It is a method of generating a classifier for estimating the information to be identified in the data.
A first training data set containing input data and training data composed of teacher data for the input data, and a second training data set containing a larger number of the training data than the first training data set. For training datasets with,
A padding step of padding the training data so that the number of the training data included in the first training data set is equal to or greater than the number of the training data included in the second training data set.
It has a generation step of generating the classifier using the training data set group having the inflated training data.
The amount of the identification target information contained in the input data included in the first learning data set is
A generation method in which the amount of identification target information contained in the input data included in the second learning data set is larger than the amount of the identification target information.
The generation method according to claim 30, wherein the data is image data, the input data is input image data, and the identification target information is an identification target area.
The generation method according to claim 31, wherein the identification target information is at least one information of the position, area, and distribution of the identification target area in the image data.
The generation method according to claim 32, wherein the amount of the identification target information is the number of pixels included in the identification target area.
32 or 33 of claim 32 or 33, wherein the padding step comprises generating new input data by performing at least one of rotation, inversion, luminance conversion, distortion addition, enlargement, and reduction of the input data. Generation method.
The generation method according to claim 30, wherein the data is sound data and the input data is sound input data.
The generation method according to claim 35, wherein the identification target information is a sound to be identified.
The generation method according to claim 36, wherein the amount of the identification target information is the intensity of a specific sound included in the identification target sound.
The generation method according to claim 36 or 37, wherein the padding step includes a step of generating new input data by adding a sound that is a combination of sounds of one type or a plurality of types to the input data.
The generation method according to claim 30, wherein the data is text data and the input data is input text data.
The generation method according to claim 39, wherein the identification target information is a character or a character string to be identified.
The generation method according to claim 39 or 40, wherein the amount of the identification target information is the number of characters or character strings to be identified.
The training data set group has three or more training data sets.
30. A claim 30 in which the padding step is performed so that the learning data included in the learning data set is the largest in the learning data set having the learning data having the largest amount of identification target information in the learning data set group. The generation method according to any one of items 41 to 41.
The generation method according to any one of claims 30 to 42, wherein the generation step is performed using U-Net.
A classifier generated by the generation method according to any one of claims 30 to 43.
An information processing device having an inference means for inferring the identification target information included in the inference data with respect to the inference data input to the classifier according to claim 44.
It is a generator of a classifier for estimating the information to be identified in the data.
A first training data set including input data and training data composed of teacher data for the input data, and
For a training data set group having a second training data set containing a larger number of the training data than the first training data set.
The number of the training data included in the first training data set is equal to or greater than the number of the training data included in the second training data set.
An inflating means for inflating the learning data and
It has a generation means for generating the classifier using the training data set group having the inflated training data.
The amount of the identification target information contained in the input data included in the first learning data set is larger than the amount of the identification target information contained in the input data included in the second learning data set.
The amount of the identification target information contained in the input data included in the first learning data set is larger than the amount of the identification target information contained in the input data included in the second learning data set. ..