EP4318395A1 - A training method and an image instance segmentation method for an image mask generator - Google Patents
A training method and an image instance segmentation method for an image mask generator Download PDFInfo
- Publication number
- EP4318395A1 EP4318395A1 EP23186126.1A EP23186126A EP4318395A1 EP 4318395 A1 EP4318395 A1 EP 4318395A1 EP 23186126 A EP23186126 A EP 23186126A EP 4318395 A1 EP4318395 A1 EP 4318395A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- image
- mask
- generator
- target object
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 178
- 238000000034 method Methods 0.000 title claims abstract description 131
- 230000011218 segmentation Effects 0.000 title claims abstract description 52
- 230000006870 function Effects 0.000 claims abstract description 41
- 230000008569 process Effects 0.000 claims description 47
- 238000009826 distribution Methods 0.000 claims description 34
- 238000001514 detection method Methods 0.000 claims description 32
- 238000004590 computer program Methods 0.000 claims description 12
- 230000002708 enhancing effect Effects 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 8
- 230000000903 blocking effect Effects 0.000 description 6
- 230000000873 masking effect Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000003860 storage Methods 0.000 description 4
- 239000003086 colorant Substances 0.000 description 3
- 238000003709 image segmentation Methods 0.000 description 3
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011067 equilibration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- the present disclosure relates to the field of image recognition, in particular to a training method and an image instance segmentation method for an image mask generator, a computer program product, and a computer device.
- Image segmentation serves as a basis for computer vision, and has become a hotspot in the field of image understanding.
- Image segmentation generally involves different tasks such as target detection, semantic segmentation, instance segmentation, etc.
- the deep learning-based instance segmentation method is being increasingly applied in the field of image understanding due to its high performance.
- Current instance segmentation methods based on conventional deep learning can obtain accurate instance segmentation results for unblocked image regions, but the instance segmentation results for blocked image regions are poor.
- the present disclosure provides a training method and an image instance segmentation method for an image mask generator, a computer program product, and a computer device to at least address some technical issues in the prior art.
- a training method for an image mask generator comprising: selecting and inputting a sample image from two sets of sample images to a generative adversarial network comprising a generator and a discriminator, each sample image comprising a target object, a target object of the first set of sample images among the two sets of sample images being unblocked, and a target object of the second set of sample images being partially blocked; using the generator to respectively generate a mask of the two sample images, the mask of each sample image being used to predict a target object in the sample image; inputting the generated masks of the two sample images to the discriminator, and constructing an adversarial loss function for the discrimination results of the generated masks of the two sample images based on the discriminator.
- the training samples used to train the generator comprise sample images with unblocked target objects and sample images with partially blocked target objects.
- the generator can generate image masks without blocked objects and with blocked objects for two different categories of sample images.
- the discriminator determines the categories of the generated masks of two sample images and constructs adversarial loss functions based on the determination results.
- the trained generator is used for predicting the masks of partially blocked image regions, the predicted mask is also very similar to the predicted mask in the unobstructed image region, and the mask of the target object in the unobstructed region can be intelligently filled to a certain extent, thereby successfully fooling the discriminator or having a very low probability of being recognized by the discriminator, and then improving the intelligence, accuracy and reliability of the generator for instance segmentation of the blocked image region.
- the adversarial loss function comprises:
- the adversarial loss function comprises a stack of a first loss item and a second loss item, wherein the first loss item is constructed based on a first discrimination result of the discriminator for a mask of the first sample image generated by the generator, the first sample image being taken from the first set of sample images; the second loss item is constructed based on a second discrimination result of the discriminator for a mask of the second sample image generated by the generator, the second sample image being taken from the second set of sample images.
- the first discrimination result comprises: a probability that a mask of the first sample image generated by the generator is a mask of an image with unblocked target objects, according to the estimation of the discriminator;
- the second discrimination result comprises: The discriminator estimates a probability that a mask of the second sample image generated by the generator is a mask of an image with at least partially blocked target objects, according to the estimation of the discriminator.
- the adversarial loss function embodies both the probability of the discriminator to discriminate the mask of the first set of images generated by the generator as an unblocked image and the probability of the discriminator to discriminate the mask of the second set of images generated by the generator as an at least partially blocked image, thereby revealing the total loss of the discriminator, wherein the second loss item forms the confrontation item between the discriminator and the generator.
- the discriminator and the generator respectively oppose each other for the purpose of improving and reducing the item loss during the training process.
- a mask of the two sample images is generated respectively, comprising: generating, with the generator, a pixel-level mask probability of a target object in the two sample images, respectively.
- the training method further comprises: Implementing object detection on the two sample images to acquire the annotated images of the two sample images, each annotated image comprising an annotated result of a bounding box of a target object in the sample image; the two sample images are inputted into the generative adversarial network, comprising: inputting the annotated images of the two sample images into the generative adversarial network; using the generator to generate the masks of the two sample images, respectively, comprising: generating, with the generator, the masks of the target objects with the annotation results in the two sample images, respectively.
- the annotated images with the annotation results with the bounding box of the target object are used as the training samples, facilitating the generator to generate a mask for the image region containing the target object in the image, and providing an effective training sample.
- the training method comprises a plurality of iterative training processes to reduce a mask probability distribution difference of a target object in two sample images generated by the generator, and/or enhancing the capability of the discriminator to differentiate between mask categories of the two sample images generated by the generator for training purposes, and repeating the training steps of the generator.
- generating, with the generator, a mask for the two sample images comprising: generating, with the generator, a mask of a plurality of target objects in at least one of the two sample images, inputting the generated masks of the two sample images into the discriminator during each training process, comprising: filtering the generated masks of the plurality of target objects in the at least one sample image to obtain the mask of one target object in each sample image, inputting the mask of one target object in each sample image into the discriminator, and constructing the adversarial loss function based on the discrimination result of the discriminator for the generated mask of one target object in each sample image.
- the training termination condition comprises: terminating the iterative training processes when a loss function value determined by the adversarial loss function is within the first predetermined threshold range; and/or, obtaining a pixel count distribution map of the mask probability of the target object in the two sample images, calculating the standard deviation of a pixel count distribution of the mask probability according to the pixel count distribution map of the mask probability, and terminating the iterative training processes when a difference of standard deviation of the pixel count distribution of the mask probability of the target object in the two sample images is within the second predetermined threshold range.
- the generator of robustness is trained to achieve training purposes of both reducing the mask probability distribution difference of a target object in the two sample images generated by the generator as well as enhancing the capability of the discriminator to differentiate between mask categories of the two sample images generated by the generator, thus achieving the Nash equilibrium.
- each set of sample images in the two sets of sample images comprises a plurality of sample images, each sample image comprising at least one target object region, each target object region comprising at least one target object, the plurality of iterative training processes comprising: During each iterative training process, selecting and inputting a sample image from the two sets of sample images as a training sample into the generative adversarial network, and passing through the plurality of sample images in each set of sample images through the plurality of iterative training processes; and/or, each sample image comprises a plurality of target object regions, different target object regions of the same sample image being used as a training sample into the generative adversarial network during different iteration training processes to pass through different target object regions of the same sample image.
- the second set of sample images involves a virtual image that forms the partially blocked target object by constructing a relative location relationship between the blocked and unblocked initial target object.
- the training method further comprising: obtaining a mask truth value of an unblocked initial target object corresponding to a partially blocked target object among the plurality of partially blocked target objects; generating, with the generator, the masks of the two sample images, respectively, comprising: generating, with the generator, the masks of the plurality of partially blocked target objects in the virtual image; and, using the acquired mask true value of the corresponding unblocked initial target object, and filtering the generated masks of the plurality of partially blocked target objects to acquire the mask of one partially blocked target object generated by the generator.
- each set of images retains a mask of the target object for training, facilitating the learning of the distribution pattern of mask probability of a single target object, and generating an image mask generator for predicting different instances.
- implementing object detection of the two sample images comprising: generating a bounding box of the partially blocked target object in the virtual image, obtaining the annotated images of the set of virtual images, according to a bounding box of the unblocked initial target object; or, generating a binary mask of the partially blocked target object in the virtual image, and generating a bounding box of the partially blocked object in the virtual image according to the generated binary mask.
- using the virtual image as a training sample facilitates accurate target detection of the partially blocked target object according to the bounding box of the unblocked initial target object before generating the mask of the virtual image, obtaining a reliable bounding box, applicable to the blocked portion covering at least a portion of bounding box of the unblocked initial target object, and unable to accurately detect the object of the partially blocked target object; and the unblocked area of the partially blocked target object may be continuously based on the detection of the bounding box, and the bounding box can be determined based on the binary mask of the partially obscured object.
- the two sample images comprise a real image from the second set of sample images, implementing object detection of the two sample images to acquire the annotated images of the two sample images, respectively, comprising:
- the object detection of the one real image is implemented by automatic annotation and/or manual annotation to obtain the annotated image of the one real image.
- the training samples of the image mask generator are not limited to virtual images, but real images can also be used, for which the object detection is implemented by manual annotation or a combination of manual annotation and automatic annotation, thereby improving the accuracy of the annotation results and enhancing the training efficiency of the generative adversarial network.
- the annotated image of each sample image further comprises an annotated result of a category of a target object in the sample image
- the training method further comprising: generating, with the generator, a category of a target object in the two sample images, respectively.
- the trained image mask generator is capable of outputting not only a pixel-level target object mask, but also a target object category for use in image instance segmentation.
- the examples of the present disclosure also provide an image instance segmentation method, comprising: implementing the object detection of a received image to identify a bounding box of a target object in the received image; using the image mask generator to generate a mask of the target object based on the bounding box, wherein the image mask generator is acquired using a training method according to the examples of the present disclosure.
- the image instance segmentation method further comprises: implementing object detection of the received image to identify the category of a target object in the received image; outputting the mask and category of the target object with the help of the image mask generator.
- the image instance segmentation method can obtain accurate instance segmentation results not only for images with unblocked objects, but also for images with blocked objects.
- the pre-trained image mask generator is adopted to obtain accurate and reliable instance segmentation results, enhancing the understanding of the instance segmentation method for image contents, especially accuracy and reliability, and expanding the application of the instance segmentation technology in the real world of presenting complex image contents.
- the examples of the present disclosure also provide a computer program product comprising a computer program that, when executed by a processor, implements a training method of an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- the examples of the present disclosure also provide for a computer-readable storage medium having executable code stored, which when executed implements a training method of an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- the examples of the present disclosure also provide a computer device comprising a processor, a memory, and a computer program stored on the memory that when executed by a processor implements a training method for an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- Fig. 1 shows a flow diagram of a training method for an image mask generator, according to an example of the present disclosure.
- the training method for the image mask generator comprises:
- the masks of the two sample images are generated with the generator, respectively, comprising: generating, with the generator, a pixel-level mask probability of a target object in the two sample images, respectively.
- the adversarial loss function comprises: The adversarial loss function comprises the stack of a first loss item and a second loss item, wherein the first loss item is constructed based on a first discrimination result of the discriminator for a mask of a first sample image generated by the generator, the first sample image being taken from the first set of sample images; the second loss item is constructed based on a second discrimination result of the discriminator for a mask of a second sample image generated by the generator, the second sample image being taken from the second set of sample images.
- the discriminator estimates a probability that a mask of a first sample image generated by the generator is a mask of an image with unblocked target objects; the second discrimination result comprises: The discriminator estimates a probability that a mask of a second sample image generated by the generator is a mask of an image with at least partially blocked target objects.
- the training method further comprises a plurality of iterative training processes that achieve a Nash equilibration for training purposes by reducing a difference in mask probability distribution of a target object in two sample images generated by the generator and enhancing the capability of the discriminator to differentiate between mask categories of two sample images generated by the generator for repeating the training steps 11, 13, 15 and 17 on the generator; at the beginning of each training, different sample images are selected from the two sets of sample images and inputted into the generative adversarial network.
- the mask category comprises a mask category for an image with unblocked target objects or a mask category for an image with at least partially blocked target objects.
- the training method may further comprise constructing the generative adversarial network.
- the training method may further comprise Step 19 to determine whether the training termination condition is satisfied; if yes, proceed to Step 191 to terminate the training; if no, proceed to Step 11 to repeat the training steps 11, 13, 15 and 17 on the generator.
- the training termination condition comprises: terminating the iterative training processes when the loss function value determined according to the adversarial loss function is within the first predetermined threshold range; and/or, obtaining a pixel count distribution map of the mask probability of the target object in the two sample images, calculating the standard deviation of a pixel count distribution of the mask probability according to the pixel count distribution map of the mask probability, and terminating the iterative training processes when the difference in the standard deviation of the pixel count distribution of the mask probability of the target object in the two sample images is within the second predetermined threshold range.
- any of the above two training termination conditions may be used as the discrimination criteria in Step 19 or both the two termination conditions are used as the discrimination criteria in Step 19. In the latter case, the two termination conditions may be required to be met simultaneously, or any of the two termination conditions shall be met before the training is terminated.
- each set of sample images may comprise a plurality of sample images
- each sample image may comprise a plurality of objects
- the target object can be the mask prediction target object in each sample image.
- Unblocked target objects in the first set of sample images comprise: One or more target objects in each sample image of the first set of sample images are in an unblocked state, and the target objects in the second set of sample images are in a partially blocked state, comprising: There may be at least one partially blocked target object in each sample image of the second set of sample images.
- each set of sample images from the two sets of sample images provided should contain as many categories of target objects as possible, comprising target objects of different features in each set of sample images with different shapes, different sizes, different categories, different colors, different numbers, and/or different locations.
- the two sample images may be randomly selected and inputted into the generative adversarial network, most notably from the two sets of sample images respectively, such that the target objects in the two sample images have different blocked states.
- a target object included in the selected two sample images may be undefined, for example, that the randomly selected target object included in the two sample images may have different features such as size, shape, category, color, number, location, etc.
- the generator and the discriminator may learn more about the difference between masks of the blocked objects and the unblocked objects from the distribution of generated mask probability of the target object in the two sample images, rather than merely or primarily learn the difference between masks of the blocked object and the unblocked object from feature information of the shape, size, category, etc.
- the training samples for training the generator comprise the sample images with unblocked target objects and the sample images with partially blocked target objects.
- the generator can generate images masked without object and blocked between objects respectively, and the discriminator judge judges can judge the mask of the two samples and build anti-loss function based on the judgment result.
- the generator can generate masks for images with unblocked objects and masks for images with blocked objects; the discriminator discriminates the generated masks of the two sample images and constructs the adversarial loss functions based on the discrimination results.
- GAN generative adversarial network
- the trained generator is used for predicting the masks of partially blocked image regions, the predicted mask is also very similar to the predicted mask in the unobstructed image region, and the mask of the target object in the unobstructed region can be intelligently filled to a certain extent, thereby successfully fooling the discriminator or having a very low probability of being recognized by the discriminator, and then improving the intelligence, accuracy and reliability of the generator for instance segmentation of the blocked image region.
- Fig. 2 shows a flow diagram of a training method for an image mask generator according to another example of the present disclosure
- Fig. 3 shows a flow diagram of processing a first sample image and a second sample image utilizing an instance segmentation model and a generative adversarial network according to an example of the present disclosure.
- the training method comprises:
- the two sample images are taken from two sets of sample images, respectively; the two sets of sample images comprise a first set of sample images and a second set of sample images; the first set of sample images comprises a plurality of first sample images, one or more target objects of each of the first sample images are in an unblocked state; the second set of sample images comprises a plurality of second sample images, at least one target object of each of the second sample images are in a partially blocked state.
- FIG. 4-5 a schematic view of a first sample image 100 and a second sample image 200 according to an example of the present disclosure is shown, respectively.
- object detection of the two sample images in Step 21 comprises: Object detection of the first sample image 100 and the second sample image 200 is performed by using the same object detector 300, wherein the same object detector 300 is a pre-trained model with fixed parameters (e.g., weights), is shown in Fig. 3 to more clearly show that the first sample image 100 and the second sample image 200 are otherwise processed by the object detector 300 to output respective annotated images, illustrating two object detectors 300 connected in dashed lines, but essentially the two object detectors 300 are the same object detector 300.
- the bounding box may be a two-dimension bounding box, as shown in Figs.
- FIG. 6 and 7 illustrating the annotated image 110 of the first sample image and the annotated image 210 of the second sample image, respectively, and a bounding box 111 of the unblocked target object 101 is also given in Fig. 6 ;
- Fig. 7 further illustrates a bounding box 211 of the partially blocked target object 201, wherein the second sample image 200 comprises a plurality of target objects, with only the annotation result of the bounding box 211 of one target object 201 being shown.
- the annotated image of each sample image obtained by object detection further comprises an annotated result for a category of a target object in the sample image.
- the training method further comprises: Using the generator 303 to generate categories of target objects with the annotation results in the two sample images, the category information of target objects with the annotation results in the two sample images may be outputted in Step 25 with the masks of target objects with the annotation results in the two sample images, thereby training the generator 303 obtained for predicting the image mask not only to output pixel-level target object masks, but also to output target object categories for image instance segmentation.
- the target object is identified based on the object detection, and the annotated images with annotation results of the bounding box of the target objects are used as the training samples to facilitate the generator 303 to generate a mask for the image region containing the target objects in the sample image, providing valid training samples.
- the instance segmentation model shown in Fig. 3 comprises an object detector 300 and a generator 303, the generator 303 being specifically a mask generator 303; here, a pre-trained object detector 300 is adopted to train the mask generator 303 with the generative adversarial network (GAN).
- GAN generative adversarial network
- Step 23 further comprises constructing the generative adversarial network (GAN).
- GAN generative adversarial network
- the GAN in the examples of the present disclosure is different from the training scene and training purpose of existing GANs. The purpose and use of existing GANs for training will not be repeated here.
- the generator of the GAN according to the examples of the present disclosure is used to make predictions of image masks, and the two inputs of the discriminator of the GAN are from the two outputs of the generator 303; the two input ends of the discriminator 305 are used to receive masks of two sample images generated by the generator 303, the two sample images having different blocked states of objects, thereby accounting for object blocking of images in the training process of the GAN.
- the GAN shown in Fig. 3 comprises a generator 303 and a discriminator 305, i.e., the generator 303 in the instance segmentation model is trained concurrently with the training of the GAN, because the GAN and the instance segmentation model share the generator 303; it may also be considered as adding a discriminator 305 for adversarial training on the generator 303 in the instance segmentation model.
- the masks of target objects having the annotation results in the two sample images with the generator 303 are generated in Step 25, comprising: generating, with the generator 303, a pixel-level mask probability for target objects with the annotation results in the two sample images, for example, generating, with the generator 303, a pixel-level mask probability for each of the sample images within a target region of the target objects having the annotation results, the mask probability being greater than or equal to 0, or less than or equal to 1.
- the target region may be an image region occupied solely by the target object or may be an image region defined by the bounding box of the target object, as shown in Figs.
- the bounding box 113 of the mask of the target object 101 in the first sample image 100 shown in Fig. 8 is clear and the mask probability of the target object 101 is evenly distributed, i.e., the mask probability values for each pixel of the target object 101 occupied within the mask's bounding box 113 are approximately, e.g., 1 all, while the mask's bounding box of the target object 201 in the second sample image 200 is blocked in Fig.
- the pixel-level mask probability size of the target object in Fig. 8 and Fig. 9 may be reflected by the gray scale or color level of the pixel of the target object region, e.g., the pixel-level mask probability size may be directly proportional to the gray scale value or color level of that pixel, Fig. 8 and Fig. 9 being shown in gray scale, and the mask probability generated in actual applications may be represented in different colors.
- a binary or thresholded pixel-level image mask probability generated by the generator 303 may be applied to acquire a binary mask of an object instance for instance segmentation, so the accuracy of the instance segmentation results depends on the accuracy of the pixel-level image mask probability generated by the generator 303.
- the mask predictions shown in Fig. 9 correspond to the two-end area of the target object 201 with a higher mask probability of masking (a relatively bright area within the bounding box 213 of the mask in Fig. 9 ).
- the masks located in the central area may be filtered out because the probability value is below the threshold, and subsequent binarization is performed directly. Only the masked portion located in the two end areas is retained, resulting in an inability to fully and correctly segment the instance. Further, because the bounding box of the masks of the target objects generated by the generator is unclear at this time, the mask probability of the plurality of target objects may be coupled at the boundary as a continuous binary mask after the binary processing, resulting in the misidentification of the plurality of objects as a single instance.
- the mask category comprises a mask category for images with unblocked target objects or a mask category for images with at least partially blocked target objects.
- the concept of a "blocking” draws upon the concept of a "mask” in semiconductor manufacturing.
- the image to be processed may be partially or fully blocked (or understood to be covered) with the selected graphic or the like to control the area of the image processing.
- the graphic used for coverage or masking, etc. may be referred to as a mask.
- the mask may generally be used to extract areas of interest in the image or shield certain areas in the image or the like.
- the mask of the image may be a mask corresponding to a foreground object in an image frame to predict an area corresponding to the foreground object in the image frame, the mask probability comprising an instance mask probability.
- the adversarial loss function 30 constructed in Step 27 comprises a stack of the first loss item and the second loss item, wherein the first loss item is constructed based on a first discrimination result of the discriminator 305 for a mask of the first sample image 100 generated by the generator 303, the first sample image 100 being taken from a first set of sample images, i.e., unblocked target objects of the first sample image 100; the second loss item is constructed based on a second discrimination result of the discriminator 305 for a mask of the second sample image 200 generated by the generator 303, the second sample image 200 being taken from a second set of sample images, i.e., partially blocked target objects of the second sample image 200.
- the first discrimination result comprises: The discriminator 305 estimates a probability that the mask of the first sample image 100 generated by the generator 303 is a mask of an image with unblocked target objects; the second discrimination result comprises: The discriminator 305 estimates a probability that the mask of the second sample image 200 generated by the generator 303 is a mask of an image with at least partially blocked target objects.
- the discriminator 305 when the discriminator 305 determines that the mask of the first sample image 100 generated by the generator 303 is the mask of an image with unblocked target objects, the first discrimination result is 1.
- the second discrimination result is 1; for the discriminator 305, the training purpose is that the larger the sum of the first discrimination result and the second discrimination result, the better, while for the generator 303, the training purpose is that the smaller the second discrimination result, the better.
- the adversarial loss function 307 reflects both the probability of the discriminator determining the masks of the first set of images generated by the generator 303 to be images with unblocked target objects and the discriminator 305 determining the masks of the second set of images generated by the generator 303 to be images with at least partially blocked target objects, thereby embodying the total loss of the discriminator 305, wherein the second loss item forms the adversarial item between the discriminator 305 and the generator 303; during the training process, the discriminator 305 and the generator oppose each other for improving and reducing the loss of this item.
- x is the mask of the (unblocked) target object with an annotation result for the first sample image 100 generated with the generator 303 in Step 25; specifically, x can be the pixel-level mask probability of the target object 101 from the first sample image 100 generated by the generator 303; D(x) is the probability that the mask x of the target object from the first sample image 100 generated by the generator 303 according to the estimation of the discriminator 305 is a mask of the image with unblocked target objects; E x is the expectancy value of the mask discrimination loss function log(D(x)) of all unblocked target objects.
- G(z) is the mask of the (partially blocked) target object with an annotation result from the second sample image 200 generated by the Generator 303 in Step 25; specifically, G(z) can be the pixel-level mask probability of the target object 201 from the second sample image 200 generated by the generator 303; D(G(z)) is the probability that the mask G(z) of the target object from the second sample image 200 generated by the generator 303 according to the estimation of the discriminator 305 is a mask of the image with unblocked target objects; E z is the expectancy value of the mask discrimination loss function log(1-D(G(z)) of all partially blocked target objects.
- the generator 303 i.e., G(.) item, attempts to minimize the value of the adversarial loss function Ladv , however the discriminator 305 D(.) attempts to maximize the value of the adversarial loss function L adv and form adversarial training.
- the adversarial loss function Ladv takes into account blocking factors between objects and is a loss function related to blocking.
- G(z) will have more similarity to x. Because x is the mask probability of the image whose target objects are not blocked according to the prediction of the generator 303, the mask x has a high quality.
- the GAN-based training method achieves the training purpose of reducing the mask probability distribution difference of the target object in the two sample images generated by the generator 303 and enhancing the capability of the discriminator 305 to differentiate the mask categories of two sample images generated by the generator 303 for Nash equilibrium.
- G(z) will have a higher quality close to the mask x after the training; also, an image mask generator 303 with higher performance is obtained, which can generate accurate and reliable instance segmentation results even for images with partially blocked target objects.
- the mask category comprises a mask category that belongs to an image with unblocked target objects or a mask category that belongs to an image with at least partially blocked target objects.
- the training method further comprises a plurality of iterative training processes, i.e., repeating steps 23, 25, 27 and 29 of training the generator 303 that select different sample image inputs from the two sets of sample images at the start of each training, the generated adversarial network, the steps 23, 25, 27 and 29 forming a circulation.
- Step 21 may be one step within the circulation of the iterative training process before Step 23. In some other examples, Step 21 may be a step outside of the circulation of the iterative training processes, i.e., after object detection is performed on each of the two sets of sample images in Step 21, then repeat execution of the loops comprising the steps 23, 25, 27 and 29 begins with one annotated image inputted into the generative adversarial network from the annotated images of the two sets of sample images in each circulation.
- the training method further comprises: updating the parameters of the discriminator 305 according to the adversarial loss function 307.
- the parameters of the generator 303 and the discriminator 305 may be updated simultaneously upon completion of a single training or at different training stages.
- the parameters of the discriminator 305 may be fixed firstly in the first training stage, the parameters of the generator 303 are updated according to the adversarial loss function 307 and the parameters of the generator 303 are fixed again in the second training stage, and the parameters of the discriminator 305 are updated according to the adversarial loss function 307.
- the training method further comprises: Step 31: determine whether the training termination condition is satisfied; if yes, terminate the training; if no, return to perform Step 21 or Step 23 (depending on whether Step 21 is located within the circulation).
- the termination conditions of the plurality of iterative training processes comprise: terminating the iterative training processes when the loss function value determined according to the adversarial loss function 307 is within the first predetermined threshold range; and/or, acquiring a pixel count distribution map for the mask probability of the two sets of images using a mask of the two sets of images generated by the generator 303, calculating standard deviation of the pixel count distribution for the mask probability according to the pixel count distribution map for the mask probability, and terminating the iterative training processes when the difference of the standard deviation of the pixel count distribution of the masking probability of the two sets of images is within the second predetermined threshold range.
- any of the above two training termination conditions may be used as the discrimination criteria in Step 31, or the above two termination conditions may be used simultaneously as the discrimination criteria in Step 31. In the latter case, the two termination conditions may be required to be met simultaneously, or any of the two termination conditions shall be met first before the training is terminated.
- the iterative training processes may be terminated when the loss function value determined from the adversarial loss function 307 is less than 0.1; and/or the iterative training processes may be terminated when the standard deviation of the pixel count distribution of the mask probability of the two sets of images is less than the preset value (e.g., 0.1).
- the first predetermined threshold range and the second predetermined threshold range may both be adjusted according to actual needs, application scenarios, or prediction effects.
- the relational graph between the mask probability of the target object in each sample image and the pixel count can be plotted based on the mask probability of the target object in the two sample images generated by the Generator 303 (as shown in Figs. 8 and 9 ).
- the horizontal axis in the two-dimensional coordinate system is the mask probability within 0-1, and the vertical axis may be the pixel count, or the horizontal axis refers to the pixel count, and the vertical axis refers to the mask probability.
- a large number of pixels are distributed in a mask probability close to 1, and for the mask probability of the target object as shown in Fig. 9 , the mask probability of all pixels of the target object may be diffused between 0 and 1.
- quantified indicators can be used to characterize, for example, in some examples, the difference by the standard deviation of the pixel count distribution of the mask probabilities of the two sample images. In other examples, other metrics for measuring the difference in mask probability distribution may also be employed.
- the generator 303 is used to generate the masks of two sample images, respectively, comprising: Using the generator 303 to generate the masks of a plurality of target objects in at least one sample image (e.g., the second sample images) of the two sample images, each training process may further perform a step of filtering the generated masks of the plurality of target objects in the at least one sample images to obtain a mask of one target object in each sample image and input the mask of one target object of each sample image into the discriminator 305.
- the two sets of sample images comprise a set of virtual images having the partially blocked target objects, the virtual image forming the partially blocked target objects by constructing a relative location relationship of the blocked and unblocked initial target objects.
- a plurality of the partially blocked target objects are present in the virtual image, the training method further comprising: obtaining a mask truth value of one unblocked object for the plurality of partially blocked target objects, and the mask truth value may be automatically generated by the system.
- the generator 303 is used to generate the masks of the two sample images, respectively, comprising: generating the masks of the plurality of partially blocked target objects in the virtual image with the generator 303; and using the mask truth value of the acquired one of the partially blocked target objects in an unblocked state to filter the generated masks of the plurality of partially blocked target objects, obtaining the mask of the one partially blocked target object generated by the generator 303 and inputting the mask of the one partially blocked target object into the discriminator 305.
- object detection of the two sample images in Step 21 comprises: generating a bounding box of the partially blocked target object in the virtual image according to the bounding box of the unblocked initial target object to acquire the annotated images of one set of virtual images.
- the unblocked initial target object has a bounding box that covers at least a portion of the initial target object to form a situation where the partially blocked target object is blocked, or where the mask covers a portion of the initial target object such that other unblocked portions of the initial target object are not in communication with one another that is truncated by the mask.
- the training method further comprises: obtaining a bounding box of the unblocked initial target object, the bounding box of the unblocked initial target object being determined according to a mask truth value of the unblocked initial target object automatically generated by the system.
- object detection of the two sample images in Step 21 comprises: generating a binary mask of a partially blocked target object in the virtual image, generating a bounding box of the partially blocked object in the virtual image according to the generated binary mask, for example, where the unblocked area of the partially blocked target object in the virtual image is continuous without affecting the detection of the bounding box thereof.
- each set of images retains a mask of the target object for training, which is conducive to improving the training efficiency of the generative adversarial network.
- Using the virtual image as a training sample also facilitates accurate target detection of the partially blocked target object according to the unblocked bounding box of the initial target object before generating the mask of the virtual image, and obtaining a reliable bounding box for the mask covering at least a portion of the unblocked bounding box of the initial target object and failing to accurately detect the object of the partially blocked target object; and the unblocked area of the partially blocked object can be continuously determined according to the condition of the detection of the bounding box.
- the other set of sample images of the two sets of sample images may also contain a plurality of unblocked objects
- the masks of the two sample images generated separately with the generator 303 in Step 25 further comprise: Using the generator 303 to generate the masks of the plurality of unblocked target objects in the other set of sample images, using the mask truth value of one of the unblocked target objects to filter the generated masks of the plurality of unblocked target objects, obtaining the masks of the unblocked target objects generated by the generator 303 and inputting the mask of the one unblocked target object into the discriminator.
- selecting a mask of one of the target objects in each sample image generated by the generator 303 for training to facilitate learning the distribution pattern of mask probability for a single target object and to generate an image mask generator for predicting different instances.
- the two sample images may comprise a real image having the partially blocked object of interest and object detection of the two sample images, wherein the annotated images of the two sample images are taken, respectively, comprising: implementing object detection of the one real image by automatic annotation and/or manual annotation to obtain an annotated image of the one real image.
- the bounding box of the plurality of blocked target objects in the real image may not be fully successfully recognized by automatic annotation. At this time, the detection success rate and reliability of the target object may be improved by manual annotation.
- each set of sample images from the two sets of sample images comprises a plurality of sample images, each sample image comprising at least one target object area, each target object area comprising at least one target object, the plurality of iterative training processes comprising: In each iterative training process, selecting and inputting a sample image from each of the two sets of sample images as a training sample into the generative adversarial network, and transversing the plurality of sample images in each set of sample images through the plurality of iterative training processes; and/or each sample image comprises a plurality of target object regions, different target object regions of the same sample image being used as a training sample into the generative adversarial network during different iteration training processes to pass through different target object regions of the same sample image.
- the utilization of each sample image as a training sample is improved to provide more extensive training data.
- the target object of the first set of sample images is not blocked, comprising: One or more target objects present in each first sample image are in an unblocked state.
- the target object of the second set of sample images is in a partially blocked state, comprising: There is at least one partially blocked target object in each of the second sample images.
- each set of sample images provided should contain as many types as possible of target objects including target objects having features having different shapes, different sizes, different categories, different colors, different numbers, and/or different locations for each set of sample images.
- the annotated images of the two sample images may be randomly selected and inputted into the generative adversarial network, most notably from the two sets of sample images, such that the target objects in the two sample images have different blocked states.
- a target object contained in the selected two sample images may not be defined, for example, that the target object contained in the randomly selected two sample images may have different features such as size, shape, category, color, number, location, etc.
- the generator 303 and the discriminator 305 may learn more about the difference between masks of the blocked object and the unblocked object from the distribution of generated mask probabilities of the target object in the two sample images, rather than simply or primarily learn the difference between the masks of the blocked object and the unblocked object corresponding to the target object from feature information of the shape, size, category, etc.
- an example of the present disclosure further provides an image instance segmentation method, comprising:
- the image instance segmentation method further comprises: implementing object detection of the received image to identify a category of a target object in the received image; outputting the mask and category of the target object with the image mask generator.
- the image instance segmentation method can obtain accurate instance segmentation results not only for images with unblocked objects, but also for images with blocked objects.
- accurate and reliable instance segmentation results can also be obtained via the image segmentation method, improving the performance of the instance segmentation method for image content understanding, such as accuracy and reliability, and expanding the application of the instance segmentation technique in the real world of presenting complex image contents.
- the examples of the present disclosure also provide for a computer program product comprising a computer program that, when executed by a processor, implements a training method according to the previous examples of the present disclosure or an image instance segmentation method according to the previous examples of the present disclosure.
- the examples of the present disclosure also provide for a computer device comprising a processor, a memory, and a computer program stored on the memory that when executed by the processor implements a training method according to the previous examples of the present disclosure or an image instance segmentation method according to the previous examples of the present disclosure.
- Embodiments of the present disclosure also provide for a computer-readable storage medium.
- the computer-readable storage medium may be stored with executable code that, when executed by a computer, causes the computer to implement a training method according to the previous examples of the present disclosure or to implement an image instance segmentation method according to the previous examples of the present disclosure.
- the computer-readable storage medium may include, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Electrically-Erasable Programmable Read-Only Memory (EPROM), Static Random Access Memory (SRAM), hard disk, flash memory, and the like.
- RAM Random Access Memory
- ROM Read-Only Memory
- EPROM Electrically-Erasable Programmable Read-Only Memory
- SRAM Static Random Access Memory
- hard disk hard disk, flash memory, and the like.
- the device structure described in the above examples may be a physical structure or a logical structure, i.e., some cells may be implemented by the same physical entity, some cells may be implemented by a plurality of physical entities, respectively, or may be implemented collectively by certain components of the plurality of independent devices.
Abstract
The present disclosure provides a training method and an image instance segmentation method for an image mask generator, the training method comprising: selecting a sample image from two sets of sample images to input a generative adversarial network comprising a generator and a discriminator, each sample image comprising a target object, a target object of the first set of sample images among the two sets of sample images being unblocked, and a target object of the second set of sample images being partially blocked; the generator is used to generate the masks of the two sample images, the mask of each sample image being used for predicting the target object of the sample image; the masks of the generated two sample images are inputted into the discriminator, and the adversarial loss functions are constructed for the discrimination results of the generated masks of the two sample images according to the discriminator; the parameters of the generator are updated based on the adversarial loss functions to train the generator.
Description
- The present disclosure relates to the field of image recognition, in particular to a training method and an image instance segmentation method for an image mask generator, a computer program product, and a computer device.
- Image segmentation serves as a basis for computer vision, and has become a hotspot in the field of image understanding. Image segmentation generally involves different tasks such as target detection, semantic segmentation, instance segmentation, etc. Specifically, the deep learning-based instance segmentation method is being increasingly applied in the field of image understanding due to its high performance. Current instance segmentation methods based on conventional deep learning can obtain accurate instance segmentation results for unblocked image regions, but the instance segmentation results for blocked image regions are poor.
- However, blocking between objects is prevalent in the real world and is a major obstacle to improving the accuracy and effectiveness of current instance segmentation methods. Therefore, there is an urgent need for an improved image instance segmentation method generally suitable for blocked and unblocked image regions, improving the accuracy and reliability of the instance segmentation results of blocked image regions.
- The present disclosure provides a training method and an image instance segmentation method for an image mask generator, a computer program product, and a computer device to at least address some technical issues in the prior art.
- According to one aspect of the present disclosure, a training method for an image mask generator is provided, comprising: selecting and inputting a sample image from two sets of sample images to a generative adversarial network comprising a generator and a discriminator, each sample image comprising a target object, a target object of the first set of sample images among the two sets of sample images being unblocked, and a target object of the second set of sample images being partially blocked; using the generator to respectively generate a mask of the two sample images, the mask of each sample image being used to predict a target object in the sample image; inputting the generated masks of the two sample images to the discriminator, and constructing an adversarial loss function for the discrimination results of the generated masks of the two sample images based on the discriminator.
- Thus, the training samples used to train the generator comprise sample images with unblocked target objects and sample images with partially blocked target objects. The generator can generate image masks without blocked objects and with blocked objects for two different categories of sample images. The discriminator determines the categories of the generated masks of two sample images and constructs adversarial loss functions based on the determination results. By leveraging the dynamic game or confrontational training of the generator and the discriminator in the generative adversarial network (GAN), a generator with robustness can be easily obtained. Even if the trained generator is used for predicting the masks of partially blocked image regions, the predicted mask is also very similar to the predicted mask in the unobstructed image region, and the mask of the target object in the unobstructed region can be intelligently filled to a certain extent, thereby successfully fooling the discriminator or having a very low probability of being recognized by the discriminator, and then improving the intelligence, accuracy and reliability of the generator for instance segmentation of the blocked image region.
- Optionally, the adversarial loss function comprises: The adversarial loss function comprises a stack of a first loss item and a second loss item, wherein the first loss item is constructed based on a first discrimination result of the discriminator for a mask of the first sample image generated by the generator, the first sample image being taken from the first set of sample images; the second loss item is constructed based on a second discrimination result of the discriminator for a mask of the second sample image generated by the generator, the second sample image being taken from the second set of sample images.
- Optionally, the first discrimination result comprises: a probability that a mask of the first sample image generated by the generator is a mask of an image with unblocked target objects, according to the estimation of the discriminator; the second discrimination result comprises: The discriminator estimates a probability that a mask of the second sample image generated by the generator is a mask of an image with at least partially blocked target objects, according to the estimation of the discriminator.
- Thus, the adversarial loss function embodies both the probability of the discriminator to discriminate the mask of the first set of images generated by the generator as an unblocked image and the probability of the discriminator to discriminate the mask of the second set of images generated by the generator as an at least partially blocked image, thereby revealing the total loss of the discriminator, wherein the second loss item forms the confrontation item between the discriminator and the generator. The discriminator and the generator respectively oppose each other for the purpose of improving and reducing the item loss during the training process.
- Optionally, using the generator, a mask of the two sample images is generated respectively, comprising: generating, with the generator, a pixel-level mask probability of a target object in the two sample images, respectively.
- As such, implementing the training to obtain an image mask generator for pixel-level instance segmentation.
- Optionally, the training method further comprises: Implementing object detection on the two sample images to acquire the annotated images of the two sample images, each annotated image comprising an annotated result of a bounding box of a target object in the sample image; the two sample images are inputted into the generative adversarial network, comprising: inputting the annotated images of the two sample images into the generative adversarial network; using the generator to generate the masks of the two sample images, respectively, comprising: generating, with the generator, the masks of the target objects with the annotation results in the two sample images, respectively.
- As such, based on the target objects identified by object detection, the annotated images with the annotation results with the bounding box of the target object are used as the training samples, facilitating the generator to generate a mask for the image region containing the target object in the image, and providing an effective training sample.
- Optionally, the training method comprises a plurality of iterative training processes to reduce a mask probability distribution difference of a target object in two sample images generated by the generator, and/or enhancing the capability of the discriminator to differentiate between mask categories of the two sample images generated by the generator for training purposes, and repeating the training steps of the generator.
- Optionally, generating, with the generator, a mask for the two sample images, respectively, comprising: generating, with the generator, a mask of a plurality of target objects in at least one of the two sample images, inputting the generated masks of the two sample images into the discriminator during each training process, comprising: filtering the generated masks of the plurality of target objects in the at least one sample image to obtain the mask of one target object in each sample image, inputting the mask of one target object in each sample image into the discriminator, and constructing the adversarial loss function based on the discrimination result of the discriminator for the generated mask of one target object in each sample image.
- As such, during a single training process, selecting a mask of one target object in each sample image generated by the generator for training, facilitating the learning of the distribution pattern of the mask of a single target object, and generating an image mask generator for predicting different instances.
- Optionally, determining whether the training termination condition is satisfied; if yes, terminating the training; if no, repeating the training steps of the generator. The training termination condition comprises: terminating the iterative training processes when a loss function value determined by the adversarial loss function is within the first predetermined threshold range; and/or, obtaining a pixel count distribution map of the mask probability of the target object in the two sample images, calculating the standard deviation of a pixel count distribution of the mask probability according to the pixel count distribution map of the mask probability, and terminating the iterative training processes when a difference of standard deviation of the pixel count distribution of the mask probability of the target object in the two sample images is within the second predetermined threshold range.
- As such, the generator of robustness is trained to achieve training purposes of both reducing the mask probability distribution difference of a target object in the two sample images generated by the generator as well as enhancing the capability of the discriminator to differentiate between mask categories of the two sample images generated by the generator, thus achieving the Nash equilibrium.
- Optionally, each set of sample images in the two sets of sample images comprises a plurality of sample images, each sample image comprising at least one target object region, each target object region comprising at least one target object, the plurality of iterative training processes comprising: During each iterative training process, selecting and inputting a sample image from the two sets of sample images as a training sample into the generative adversarial network, and passing through the plurality of sample images in each set of sample images through the plurality of iterative training processes; and/or, each sample image comprises a plurality of target object regions, different target object regions of the same sample image being used as a training sample into the generative adversarial network during different iteration training processes to pass through different target object regions of the same sample image.
- As such, the utilization of each image as a training sample is improved, providing more extensive training data.
- Optionally, the second set of sample images involves a virtual image that forms the partially blocked target object by constructing a relative location relationship between the blocked and unblocked initial target object.
- Optionally, there are a plurality of the partially blocked target objects in the virtual image, the training method further comprising: obtaining a mask truth value of an unblocked initial target object corresponding to a partially blocked target object among the plurality of partially blocked target objects; generating, with the generator, the masks of the two sample images, respectively, comprising: generating, with the generator, the masks of the plurality of partially blocked target objects in the virtual image; and, using the acquired mask true value of the corresponding unblocked initial target object, and filtering the generated masks of the plurality of partially blocked target objects to acquire the mask of one partially blocked target object generated by the generator.
- Thus, using the virtual image as a training sample to obtain the mask truth value of the partially blocked target object facilitates the filtering of the generated masks of the plurality of target objects with the mask truth value of one of the target objects when the generator generates the masks of the plurality of partially blocked target objects. During a single training process, each set of images retains a mask of the target object for training, facilitating the learning of the distribution pattern of mask probability of a single target object, and generating an image mask generator for predicting different instances.
- Optionally, implementing object detection of the two sample images, comprising: generating a bounding box of the partially blocked target object in the virtual image, obtaining the annotated images of the set of virtual images, according to a bounding box of the unblocked initial target object; or, generating a binary mask of the partially blocked target object in the virtual image, and generating a bounding box of the partially blocked object in the virtual image according to the generated binary mask.
- Thus, using the virtual image as a training sample facilitates accurate target detection of the partially blocked target object according to the bounding box of the unblocked initial target object before generating the mask of the virtual image, obtaining a reliable bounding box, applicable to the blocked portion covering at least a portion of bounding box of the unblocked initial target object, and unable to accurately detect the object of the partially blocked target object; and the unblocked area of the partially blocked target object may be continuously based on the detection of the bounding box, and the bounding box can be determined based on the binary mask of the partially obscured object.
- Optionally, the two sample images comprise a real image from the second set of sample images, implementing object detection of the two sample images to acquire the annotated images of the two sample images, respectively, comprising: The object detection of the one real image is implemented by automatic annotation and/or manual annotation to obtain the annotated image of the one real image.
- As such, the training samples of the image mask generator are not limited to virtual images, but real images can also be used, for which the object detection is implemented by manual annotation or a combination of manual annotation and automatic annotation, thereby improving the accuracy of the annotation results and enhancing the training efficiency of the generative adversarial network.
- Optionally, the annotated image of each sample image further comprises an annotated result of a category of a target object in the sample image, the training method further comprising: generating, with the generator, a category of a target object in the two sample images, respectively.
- As such, the trained image mask generator is capable of outputting not only a pixel-level target object mask, but also a target object category for use in image instance segmentation.
- In another aspect, the examples of the present disclosure also provide an image instance segmentation method, comprising: implementing the object detection of a received image to identify a bounding box of a target object in the received image; using the image mask generator to generate a mask of the target object based on the bounding box, wherein the image mask generator is acquired using a training method according to the examples of the present disclosure.
- Optionally, the image instance segmentation method further comprises: implementing object detection of the received image to identify the category of a target object in the received image; outputting the mask and category of the target object with the help of the image mask generator.
- Thus, the image instance segmentation method according to the examples of the present disclosure can obtain accurate instance segmentation results not only for images with unblocked objects, but also for images with blocked objects. The pre-trained image mask generator is adopted to obtain accurate and reliable instance segmentation results, enhancing the understanding of the instance segmentation method for image contents, especially accuracy and reliability, and expanding the application of the instance segmentation technology in the real world of presenting complex image contents.
- In another aspect, the examples of the present disclosure also provide a computer program product comprising a computer program that, when executed by a processor, implements a training method of an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- In another aspect, the examples of the present disclosure also provide for a computer-readable storage medium having executable code stored, which when executed implements a training method of an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- In another aspect, the examples of the present disclosure also provide a computer device comprising a processor, a memory, and a computer program stored on the memory that when executed by a processor implements a training method for an image mask generator according to the examples of the present disclosure or an image instance segmentation method according to the examples of the present disclosure.
- The principles, features, and advantages of the present disclosure may be better understood below by describing the present disclosure in more detail with reference to the appended drawings. The drawings include:
-
Fig. 1 shows a flow diagram of a training method for an image mask generator according to an example of the present disclosure; -
Fig. 2 shows a flow diagram of a training method for an image mask generator according to another example of the present disclosure; -
Fig. 3 shows a flow diagram of an instance segmentation model and a processing process of a generative adversarial network for a first sample image and a second sample image, according to an example of the present disclosure; -
Fig. 4 shows a schematic view of a first sample image according to an example of the present disclosure; -
Fig. 5 shows a schematic view of a second sample image according to an example of the present disclosure; -
Fig. 6 shows a schematic view of an annotated image of a first sample image according to the example shown inFig. 4 ; -
Fig. 7 shows a schematic view of an annotated image of a second sample image according to the example shown inFig. 5 ; -
Fig. 8 shows a schematic view of a mask of a first sample image generated by a generator during the training process; -
Fig. 9 shows a schematic view of a mask of a second sample image generated by a generator during the training process; -
Fig. 10 shows a flow diagram of an image instance segmentation method, according to an example of the present disclosure. - In order to make the above purposes, features and beneficial effects of the present disclosure more apparent and easier to understand, the specific examples of the present disclosure are described in detail below in conjunction with the appended drawings. The various examples in this description are described in a progressive manner. Each example focuses on different areas from the other examples. The same or similar parts of each example may be shown to each other.
- It should be understood that the expression "first", "second", etc., is for descriptive purposes only and is not to be understood as indicating or implying relative importance, nor is it to be understood as implying an indication of the number of indicated technical features. A feature defined as "first" or "second" may expressly or implicitly represent including at least one of the features.
- Referring to
Fig. 1, Fig. 1 shows a flow diagram of a training method for an image mask generator, according to an example of the present disclosure. - In some examples, the training method for the image mask generator comprises:
- Step 11: selecting and inputting a sample image from each of the two sets of sample images into the generative adversarial network comprising a generator and a discriminator, each sample image comprising a target object, the target object of each sample image of the first set of sample images among the two sets of sample images being unblocked, and the target object of each sample image of the second set of sample images being blocked;
- Step 13: generating, with the generator, the masks of the two sample images, the mask of each sample image being used to predict a target object in the sample image;
- Step 15: inputting the generated masks of the two sample images into the discriminator and constructing an adversarial loss function according to the discriminator's discrimination result of the generated mask of the two sample images;
- Step 17: updating the parameters of the generator according to the adversarial loss function to train the generator.
- In some examples of the present disclosure, in
Step 13, the masks of the two sample images are generated with the generator, respectively, comprising: generating, with the generator, a pixel-level mask probability of a target object in the two sample images, respectively. - In some examples of the present disclosure, the adversarial loss function comprises: The adversarial loss function comprises the stack of a first loss item and a second loss item, wherein the first loss item is constructed based on a first discrimination result of the discriminator for a mask of a first sample image generated by the generator, the first sample image being taken from the first set of sample images; the second loss item is constructed based on a second discrimination result of the discriminator for a mask of a second sample image generated by the generator, the second sample image being taken from the second set of sample images.
- In some examples of the present disclosure, the discriminator estimates a probability that a mask of a first sample image generated by the generator is a mask of an image with unblocked target objects; the second discrimination result comprises: The discriminator estimates a probability that a mask of a second sample image generated by the generator is a mask of an image with at least partially blocked target objects.
- In some examples of the present disclosure, the training method further comprises a plurality of iterative training processes that achieve a Nash equilibration for training purposes by reducing a difference in mask probability distribution of a target object in two sample images generated by the generator and enhancing the capability of the discriminator to differentiate between mask categories of two sample images generated by the generator for repeating the training steps 11, 13, 15 and 17 on the generator; at the beginning of each training, different sample images are selected from the two sets of sample images and inputted into the generative adversarial network. The mask category comprises a mask category for an image with unblocked target objects or a mask category for an image with at least partially blocked target objects.
- In some examples of the present disclosure, the training method may further comprise constructing the generative adversarial network.
- In some examples of the present disclosure, the training method may further comprise
Step 19 to determine whether the training termination condition is satisfied; if yes, proceed to Step 191 to terminate the training; if no, proceed to Step 11 to repeat the training steps 11, 13, 15 and 17 on the generator. - In some examples of the present disclosure, the training termination condition comprises: terminating the iterative training processes when the loss function value determined according to the adversarial loss function is within the first predetermined threshold range; and/or, obtaining a pixel count distribution map of the mask probability of the target object in the two sample images, calculating the standard deviation of a pixel count distribution of the mask probability according to the pixel count distribution map of the mask probability, and terminating the iterative training processes when the difference in the standard deviation of the pixel count distribution of the mask probability of the target object in the two sample images is within the second predetermined threshold range.
- Any of the above two training termination conditions may be used as the discrimination criteria in
Step 19 or both the two termination conditions are used as the discrimination criteria inStep 19. In the latter case, the two termination conditions may be required to be met simultaneously, or any of the two termination conditions shall be met before the training is terminated. - In some examples of the present disclosure, each set of sample images may comprise a plurality of sample images, each sample image may comprise a plurality of objects, wherein the target object can be the mask prediction target object in each sample image. Unblocked target objects in the first set of sample images comprise: One or more target objects in each sample image of the first set of sample images are in an unblocked state, and the target objects in the second set of sample images are in a partially blocked state, comprising: There may be at least one partially blocked target object in each sample image of the second set of sample images. In order to train the mask generator with robustness, each set of sample images from the two sets of sample images provided should contain as many categories of target objects as possible, comprising target objects of different features in each set of sample images with different shapes, different sizes, different categories, different colors, different numbers, and/or different locations.
- Accordingly, in
Step 11, the two sample images may be randomly selected and inputted into the generative adversarial network, most notably from the two sets of sample images respectively, such that the target objects in the two sample images have different blocked states. As to whether a target object included in the selected two sample images has the same or similar features, it may be undefined, for example, that the randomly selected target object included in the two sample images may have different features such as size, shape, category, color, number, location, etc. By randomly selecting and inputting the two sample images into the generative adversarial network, the generator and the discriminator may learn more about the difference between masks of the blocked objects and the unblocked objects from the distribution of generated mask probability of the target object in the two sample images, rather than merely or primarily learn the difference between masks of the blocked object and the unblocked object from feature information of the shape, size, category, etc. - In the training method for the image mask generator according to the examples of the present disclosure, the training samples for training the generator comprise the sample images with unblocked target objects and the sample images with partially blocked target objects. The generator can generate images masked without object and blocked between objects respectively, and the discriminator judge judges can judge the mask of the two samples and build anti-loss function based on the judgment result. The generator can generate masks for images with unblocked objects and masks for images with blocked objects; the discriminator discriminates the generated masks of the two sample images and constructs the adversarial loss functions based on the discrimination results. By leveraging the dynamic game or confrontational training of the generator and the discriminator in the generative adversarial network (GAN), a generator with robustness can be easily obtained. Even if the trained generator is used for predicting the masks of partially blocked image regions, the predicted mask is also very similar to the predicted mask in the unobstructed image region, and the mask of the target object in the unobstructed region can be intelligently filled to a certain extent, thereby successfully fooling the discriminator or having a very low probability of being recognized by the discriminator, and then improving the intelligence, accuracy and reliability of the generator for instance segmentation of the blocked image region.
- Referring to
Fig. 2, Fig. 2 shows a flow diagram of a training method for an image mask generator according to another example of the present disclosure, andFig. 3 shows a flow diagram of processing a first sample image and a second sample image utilizing an instance segmentation model and a generative adversarial network according to an example of the present disclosure. - In some examples of the present disclosure, the training method comprises:
- Step 21: an object detection is performed on two sample images to obtain an annotated image of the two sample images, respectively. The annotated image of each sample image comprises an annotation result of a bounding box of a target object in the two sample images, the target object in one of the two sample images is in an unblocked state, while the target object in the other sample image is in a partially blocked state.
- Step 23: inputting the annotated image of the two sample images into a generative adversarial network comprising a
generator 303 and adiscriminator 305; - Step 25: generating, with the
generator 303, a mask of a target object having the annotation result in the two sample images according to the annotated images of the two sample images; - Step 27: inputting the generated masks of the two sample images into the
discriminator 305 to construct anadversarial loss function 307 based on the discrimination result of thediscriminator 305 for the generated masks of the two sample images; - Step 29: updating the parameters of the
generator 303 according to theadversarial loss function 307 to train thegenerator 303. - In some examples of the present disclosure, in
Step 21, the two sample images are taken from two sets of sample images, respectively; the two sets of sample images comprise a first set of sample images and a second set of sample images; the first set of sample images comprises a plurality of first sample images, one or more target objects of each of the first sample images are in an unblocked state; the second set of sample images comprises a plurality of second sample images, at least one target object of each of the second sample images are in a partially blocked state. - As shown in
Figs. 4-5 , a schematic view of afirst sample image 100 and asecond sample image 200 according to an example of the present disclosure is shown, respectively. In this example, there is one unblockedtarget object 101 in thefirst sample image 100 and at least one partially blockedtarget object 201 in thesecond sample image 200. - As shown in
Fig. 3 , in some examples of the present disclosure, object detection of the two sample images inStep 21 comprises: Object detection of thefirst sample image 100 and thesecond sample image 200 is performed by using thesame object detector 300, wherein thesame object detector 300 is a pre-trained model with fixed parameters (e.g., weights), is shown inFig. 3 to more clearly show that thefirst sample image 100 and thesecond sample image 200 are otherwise processed by theobject detector 300 to output respective annotated images, illustrating twoobject detectors 300 connected in dashed lines, but essentially the twoobject detectors 300 are thesame object detector 300. The bounding box may be a two-dimension bounding box, as shown inFigs. 6 and7 , illustrating the annotatedimage 110 of the first sample image and the annotatedimage 210 of the second sample image, respectively, and abounding box 111 of the unblockedtarget object 101 is also given inFig. 6 ;Fig. 7 further illustrates abounding box 211 of the partially blockedtarget object 201, wherein thesecond sample image 200 comprises a plurality of target objects, with only the annotation result of thebounding box 211 of onetarget object 201 being shown. - In some examples of the present disclosure, in
Step 21, the annotated image of each sample image obtained by object detection further comprises an annotated result for a category of a target object in the sample image. The training method further comprises: Using thegenerator 303 to generate categories of target objects with the annotation results in the two sample images, the category information of target objects with the annotation results in the two sample images may be outputted inStep 25 with the masks of target objects with the annotation results in the two sample images, thereby training thegenerator 303 obtained for predicting the image mask not only to output pixel-level target object masks, but also to output target object categories for image instance segmentation. - The target object is identified based on the object detection, and the annotated images with annotation results of the bounding box of the target objects are used as the training samples to facilitate the
generator 303 to generate a mask for the image region containing the target objects in the sample image, providing valid training samples. Further, the instance segmentation model shown inFig. 3 comprises anobject detector 300 and agenerator 303, thegenerator 303 being specifically amask generator 303; here, apre-trained object detector 300 is adopted to train themask generator 303 with the generative adversarial network (GAN). - In some examples of the present disclosure,
Step 23 further comprises constructing the generative adversarial network (GAN). The GAN in the examples of the present disclosure is different from the training scene and training purpose of existing GANs. The purpose and use of existing GANs for training will not be repeated here. The generator of the GAN according to the examples of the present disclosure is used to make predictions of image masks, and the two inputs of the discriminator of the GAN are from the two outputs of thegenerator 303; the two input ends of thediscriminator 305 are used to receive masks of two sample images generated by thegenerator 303, the two sample images having different blocked states of objects, thereby accounting for object blocking of images in the training process of the GAN. - The GAN shown in
Fig. 3 comprises agenerator 303 and adiscriminator 305, i.e., thegenerator 303 in the instance segmentation model is trained concurrently with the training of the GAN, because the GAN and the instance segmentation model share thegenerator 303; it may also be considered as adding adiscriminator 305 for adversarial training on thegenerator 303 in the instance segmentation model. - Referring further to
Fig. 2 , in conjunction withFigs. 8 and9 , in some examples of the present disclosure, the masks of target objects having the annotation results in the two sample images with thegenerator 303 are generated inStep 25, comprising: generating, with thegenerator 303, a pixel-level mask probability for target objects with the annotation results in the two sample images, for example, generating, with thegenerator 303, a pixel-level mask probability for each of the sample images within a target region of the target objects having the annotation results, the mask probability being greater than or equal to 0, or less than or equal to 1. The target region may be an image region occupied solely by the target object or may be an image region defined by the bounding box of the target object, as shown inFigs. 8 and9 , respectively illustrating a pixel-level mask probability within the image region defined by the boundingboxes first sample image 100 is not blocked, thebounding box 113 of the mask of thetarget object 101 in thefirst sample image 100 shown inFig. 8 is clear and the mask probability of thetarget object 101 is evenly distributed, i.e., the mask probability values for each pixel of thetarget object 101 occupied within the mask'sbounding box 113 are approximately, e.g., 1 all, while the mask's bounding box of thetarget object 201 in thesecond sample image 200 is blocked inFig. 9 is obscured and the masked region of the target object of thetarget object 201 is distributed within the target object of thesecond sample image 200. The pixel-level mask probability size of the target object inFig. 8 andFig. 9 may be reflected by the gray scale or color level of the pixel of the target object region, e.g., the pixel-level mask probability size may be directly proportional to the gray scale value or color level of that pixel,Fig. 8 andFig. 9 being shown in gray scale, and the mask probability generated in actual applications may be represented in different colors. - When the training of the
generator 303 is complete, a binary or thresholded pixel-level image mask probability generated by thegenerator 303 may be applied to acquire a binary mask of an object instance for instance segmentation, so the accuracy of the instance segmentation results depends on the accuracy of the pixel-level image mask probability generated by thegenerator 303. For instances where there is object blocking in the corresponding image, with untrained image mask generators, there is an uneven distribution of mask probability for the target object. Different pixel counts may be distributed between 0 and 1, e.g., the mask predictions shown inFig. 9 correspond to the two-end area of thetarget object 201 with a higher mask probability of masking (a relatively bright area within thebounding box 213 of the mask inFig. 9 ). The mask predictions shown inFig. 7 correspond to the central area of thetarget object 201 with a lower mask probability of masking (a relatively dark area within thebounding box 213 of the mask inFig. 9 ). If this mask probability distribution is generated by an untrained generator at the initial stage, the masks located in the central area may be filtered out because the probability value is below the threshold, and subsequent binarization is performed directly. Only the masked portion located in the two end areas is retained, resulting in an inability to fully and correctly segment the instance. Further, because the bounding box of the masks of the target objects generated by the generator is unclear at this time, the mask probability of the plurality of target objects may be coupled at the boundary as a continuous binary mask after the binary processing, resulting in the misidentification of the plurality of objects as a single instance. To achieve the training purposes of reducing the mask probability distribution difference of a target object in the two sample images generated by thegenerator 303 and enhancing the capability of thediscriminator 305 to differentiate the mask category of the two sample images generated by thegenerator 303 to achieve the Nash equilibrium, and iterating thegenerator 303 so that thegenerator 303 can be used for images with blocked objects for predicting the masks with unblocked objects and improving the performance of thegenerator 303. The mask category comprises a mask category for images with unblocked target objects or a mask category for images with at least partially blocked target objects. - It should be noted that in the field of image processing, the concept of a "blocking" (or masking) draws upon the concept of a "mask" in semiconductor manufacturing. In particular, the image to be processed may be partially or fully blocked (or understood to be covered) with the selected graphic or the like to control the area of the image processing. The graphic used for coverage or masking, etc., may be referred to as a mask. The mask may generally be used to extract areas of interest in the image or shield certain areas in the image or the like. In the examples of the present disclosure, the mask of the image may be a mask corresponding to a foreground object in an image frame to predict an area corresponding to the foreground object in the image frame, the mask probability comprising an instance mask probability.
- In some examples of the present disclosure, the adversarial loss function 30 constructed in
Step 27 comprises a stack of the first loss item and the second loss item, wherein the first loss item is constructed based on a first discrimination result of thediscriminator 305 for a mask of thefirst sample image 100 generated by thegenerator 303, thefirst sample image 100 being taken from a first set of sample images, i.e., unblocked target objects of thefirst sample image 100; the second loss item is constructed based on a second discrimination result of thediscriminator 305 for a mask of thesecond sample image 200 generated by thegenerator 303, thesecond sample image 200 being taken from a second set of sample images, i.e., partially blocked target objects of thesecond sample image 200. - In some examples of the present disclosure, the first discrimination result comprises: The
discriminator 305 estimates a probability that the mask of thefirst sample image 100 generated by thegenerator 303 is a mask of an image with unblocked target objects; the second discrimination result comprises: Thediscriminator 305 estimates a probability that the mask of thesecond sample image 200 generated by thegenerator 303 is a mask of an image with at least partially blocked target objects. - In some examples of the present disclosure, when the
discriminator 305 determines that the mask of thefirst sample image 100 generated by thegenerator 303 is the mask of an image with unblocked target objects, the first discrimination result is 1. When thediscriminator 305 determines that the mask of thesecond sample image 200 generated by thegenerator 303 is the mask of an image with at least partially blocked target objects, the second discrimination result is 1; for thediscriminator 305, the training purpose is that the larger the sum of the first discrimination result and the second discrimination result, the better, while for thegenerator 303, the training purpose is that the smaller the second discrimination result, the better. - As such, the
adversarial loss function 307 reflects both the probability of the discriminator determining the masks of the first set of images generated by thegenerator 303 to be images with unblocked target objects and thediscriminator 305 determining the masks of the second set of images generated by thegenerator 303 to be images with at least partially blocked target objects, thereby embodying the total loss of thediscriminator 305, wherein the second loss item forms the adversarial item between thediscriminator 305 and thegenerator 303; during the training process, thediscriminator 305 and the generator oppose each other for improving and reducing the loss of this item. - In particular, the adversarial loss function Ladv may be defined as:
first sample image 100 generated with thegenerator 303 inStep 25; specifically, x can be the pixel-level mask probability of thetarget object 101 from thefirst sample image 100 generated by thegenerator 303; D(x) is the probability that the mask x of the target object from thefirst sample image 100 generated by thegenerator 303 according to the estimation of thediscriminator 305 is a mask of the image with unblocked target objects; Ex is the expectancy value of the mask discrimination loss function log(D(x)) of all unblocked target objects. G(z) is the mask of the (partially blocked) target object with an annotation result from thesecond sample image 200 generated by theGenerator 303 inStep 25; specifically, G(z) can be the pixel-level mask probability of thetarget object 201 from thesecond sample image 200 generated by thegenerator 303; D(G(z)) is the probability that the mask G(z) of the target object from thesecond sample image 200 generated by thegenerator 303 according to the estimation of thediscriminator 305 is a mask of the image with unblocked target objects; Ez is the expectancy value of the mask discrimination loss function log(1-D(G(z)) of all partially blocked target objects. - During training, the
generator 303, i.e., G(.) item, attempts to minimize the value of the adversarial loss function Ladv, however the discriminator 305 D(.) attempts to maximize the value of the adversarial loss function Ladv and form adversarial training. The adversarial loss function Ladv takes into account blocking factors between objects and is a loss function related to blocking. By adversarial training, G(z) will have more similarity to x. Because x is the mask probability of the image whose target objects are not blocked according to the prediction of thegenerator 303, the mask x has a high quality. The GAN-based training method achieves the training purpose of reducing the mask probability distribution difference of the target object in the two sample images generated by thegenerator 303 and enhancing the capability of thediscriminator 305 to differentiate the mask categories of two sample images generated by thegenerator 303 for Nash equilibrium. Thus, G(z) will have a higher quality close to the mask x after the training; also, animage mask generator 303 with higher performance is obtained, which can generate accurate and reliable instance segmentation results even for images with partially blocked target objects. The mask category comprises a mask category that belongs to an image with unblocked target objects or a mask category that belongs to an image with at least partially blocked target objects. In some examples, the training method further comprises a plurality of iterative training processes, i.e., repeatingsteps generator 303 that select different sample image inputs from the two sets of sample images at the start of each training, the generated adversarial network, thesteps - In some examples,
Step 21 may be one step within the circulation of the iterative training process beforeStep 23. In some other examples,Step 21 may be a step outside of the circulation of the iterative training processes, i.e., after object detection is performed on each of the two sets of sample images inStep 21, then repeat execution of the loops comprising thesteps - In some examples, the training method further comprises: updating the parameters of the
discriminator 305 according to theadversarial loss function 307. The parameters of thegenerator 303 and thediscriminator 305 may be updated simultaneously upon completion of a single training or at different training stages. For example, the parameters of thediscriminator 305 may be fixed firstly in the first training stage, the parameters of thegenerator 303 are updated according to theadversarial loss function 307 and the parameters of thegenerator 303 are fixed again in the second training stage, and the parameters of thediscriminator 305 are updated according to theadversarial loss function 307. - In some examples, the training method further comprises: Step 31: determine whether the training termination condition is satisfied; if yes, terminate the training; if no, return to perform
Step 21 or Step 23 (depending on whetherStep 21 is located within the circulation). - In some examples, the termination conditions of the plurality of iterative training processes comprise: terminating the iterative training processes when the loss function value determined according to the
adversarial loss function 307 is within the first predetermined threshold range; and/or, acquiring a pixel count distribution map for the mask probability of the two sets of images using a mask of the two sets of images generated by thegenerator 303, calculating standard deviation of the pixel count distribution for the mask probability according to the pixel count distribution map for the mask probability, and terminating the iterative training processes when the difference of the standard deviation of the pixel count distribution of the masking probability of the two sets of images is within the second predetermined threshold range. Any of the above two training termination conditions may be used as the discrimination criteria inStep 31, or the above two termination conditions may be used simultaneously as the discrimination criteria inStep 31. In the latter case, the two termination conditions may be required to be met simultaneously, or any of the two termination conditions shall be met first before the training is terminated. - Specifically, the iterative training processes may be terminated when the loss function value determined from the
adversarial loss function 307 is less than 0.1; and/or the iterative training processes may be terminated when the standard deviation of the pixel count distribution of the mask probability of the two sets of images is less than the preset value (e.g., 0.1). The first predetermined threshold range and the second predetermined threshold range may both be adjusted according to actual needs, application scenarios, or prediction effects. - In some examples, the relational graph between the mask probability of the target object in each sample image and the pixel count can be plotted based on the mask probability of the target object in the two sample images generated by the Generator 303 (as shown in
Figs. 8 and9 ). For example, the horizontal axis in the two-dimensional coordinate system is the mask probability within 0-1, and the vertical axis may be the pixel count, or the horizontal axis refers to the pixel count, and the vertical axis refers to the mask probability. A large number of pixels are distributed in a mask probability close to 1, and for the mask probability of the target object as shown inFig. 9 , the mask probability of all pixels of the target object may be diffused between 0 and 1. This makes the mask probability distribution of the target object in the two sample images different greatly. In order to measure this difference, quantified indicators can be used to characterize, for example, in some examples, the difference by the standard deviation of the pixel count distribution of the mask probabilities of the two sample images. In other examples, other metrics for measuring the difference in mask probability distribution may also be employed. - Based on the training process of the above termination conditions, by achieving the training purposes of reducing the mask probability distribution difference of target objects in the two sample images generated by the
generator 303 and enhancing the capability of thediscriminator 305 to discriminate the mask category of the two sample images generated by thegenerator 303 for Nash equilibrium, agenerator 303 with the robustness can be obtained. - In some examples, in
Step 25, thegenerator 303 is used to generate the masks of two sample images, respectively, comprising: Using thegenerator 303 to generate the masks of a plurality of target objects in at least one sample image (e.g., the second sample images) of the two sample images, each training process may further perform a step of filtering the generated masks of the plurality of target objects in the at least one sample images to obtain a mask of one target object in each sample image and input the mask of one target object of each sample image into thediscriminator 305. - In some examples, the two sets of sample images comprise a set of virtual images having the partially blocked target objects, the virtual image forming the partially blocked target objects by constructing a relative location relationship of the blocked and unblocked initial target objects. a plurality of the partially blocked target objects are present in the virtual image, the training method further comprising: obtaining a mask truth value of one unblocked object for the plurality of partially blocked target objects, and the mask truth value may be automatically generated by the system. In
Step 25, thegenerator 303 is used to generate the masks of the two sample images, respectively, comprising: generating the masks of the plurality of partially blocked target objects in the virtual image with thegenerator 303; and using the mask truth value of the acquired one of the partially blocked target objects in an unblocked state to filter the generated masks of the plurality of partially blocked target objects, obtaining the mask of the one partially blocked target object generated by thegenerator 303 and inputting the mask of the one partially blocked target object into thediscriminator 305. - In some examples, object detection of the two sample images in
Step 21 comprises: generating a bounding box of the partially blocked target object in the virtual image according to the bounding box of the unblocked initial target object to acquire the annotated images of one set of virtual images. For example, the unblocked initial target object has a bounding box that covers at least a portion of the initial target object to form a situation where the partially blocked target object is blocked, or where the mask covers a portion of the initial target object such that other unblocked portions of the initial target object are not in communication with one another that is truncated by the mask. Employing the bounding box of the unblocked initial target object as the bounding box of the partially blocked target object in the virtual image facilitates obtaining a reliable bounding box, and subsequent masking prediction by thegenerator 303 based on the reliable boundary frame facilitates improving the performance of the trainedgenerator 303. In some examples, the training method further comprises: obtaining a bounding box of the unblocked initial target object, the bounding box of the unblocked initial target object being determined according to a mask truth value of the unblocked initial target object automatically generated by the system. - In some other examples, object detection of the two sample images in
Step 21 comprises: generating a binary mask of a partially blocked target object in the virtual image, generating a bounding box of the partially blocked object in the virtual image according to the generated binary mask, for example, where the unblocked area of the partially blocked target object in the virtual image is continuous without affecting the detection of the bounding box thereof. - Using the virtual image as a training sample to facilitate obtaining the mask truth value of the partially blocked target object and to facilitate filtering of the generated masks of the plurality of target objects with the truth values of one of the target objects when the
generator 303 generates masks of images with the plurality of blocked target objects. During a single training process, each set of images retains a mask of the target object for training, which is conducive to improving the training efficiency of the generative adversarial network. - Using the virtual image as a training sample also facilitates accurate target detection of the partially blocked target object according to the unblocked bounding box of the initial target object before generating the mask of the virtual image, and obtaining a reliable bounding box for the mask covering at least a portion of the unblocked bounding box of the initial target object and failing to accurately detect the object of the partially blocked target object; and the unblocked area of the partially blocked object can be continuously determined according to the condition of the detection of the bounding box.
- In some examples, the other set of sample images of the two sets of sample images may also contain a plurality of unblocked objects, and the masks of the two sample images generated separately with the
generator 303 inStep 25 further comprise: Using thegenerator 303 to generate the masks of the plurality of unblocked target objects in the other set of sample images, using the mask truth value of one of the unblocked target objects to filter the generated masks of the plurality of unblocked target objects, obtaining the masks of the unblocked target objects generated by thegenerator 303 and inputting the mask of the one unblocked target object into the discriminator. During a single training process, selecting a mask of one of the target objects in each sample image generated by thegenerator 303 for training to facilitate learning the distribution pattern of mask probability for a single target object and to generate an image mask generator for predicting different instances. - In some examples, the two sample images may comprise a real image having the partially blocked object of interest and object detection of the two sample images, wherein the annotated images of the two sample images are taken, respectively, comprising: implementing object detection of the one real image by automatic annotation and/or manual annotation to obtain an annotated image of the one real image. The bounding box of the plurality of blocked target objects in the real image may not be fully successfully recognized by automatic annotation. At this time, the detection success rate and reliability of the target object may be improved by manual annotation.
- In some examples, each set of sample images from the two sets of sample images comprises a plurality of sample images, each sample image comprising at least one target object area, each target object area comprising at least one target object, the plurality of iterative training processes comprising: In each iterative training process, selecting and inputting a sample image from each of the two sets of sample images as a training sample into the generative adversarial network, and transversing the plurality of sample images in each set of sample images through the plurality of iterative training processes; and/or each sample image comprises a plurality of target object regions, different target object regions of the same sample image being used as a training sample into the generative adversarial network during different iteration training processes to pass through different target object regions of the same sample image. As such, the utilization of each sample image as a training sample is improved to provide more extensive training data.
- Similar to the example shown in
Fig. 1 , in the example shown inFigs. 2-9 , the target object of the first set of sample images is not blocked, comprising: One or more target objects present in each first sample image are in an unblocked state. The target object of the second set of sample images is in a partially blocked state, comprising: There is at least one partially blocked target object in each of the second sample images. In order to train the mask generator with robustness, each set of sample images provided should contain as many types as possible of target objects including target objects having features having different shapes, different sizes, different categories, different colors, different numbers, and/or different locations for each set of sample images. - Accordingly, in
Step 23, the annotated images of the two sample images may be randomly selected and inputted into the generative adversarial network, most notably from the two sets of sample images, such that the target objects in the two sample images have different blocked states. As to whether a target object contained in the selected two sample images has the same or similar features, it may not be defined, for example, that the target object contained in the randomly selected two sample images may have different features such as size, shape, category, color, number, location, etc. By randomly selecting and inputting the two sample images into the generative adversarial network, thegenerator 303 and thediscriminator 305 may learn more about the difference between masks of the blocked object and the unblocked object from the distribution of generated mask probabilities of the target object in the two sample images, rather than simply or primarily learn the difference between the masks of the blocked object and the unblocked object corresponding to the target object from feature information of the shape, size, category, etc. - Referring to
Fig. 9 , an example of the present disclosure further provides an image instance segmentation method, comprising: - Step 51: implementing object detection of the received image to identify a bounding box of a target object in the received image;
- Step 53: using an image mask generator to generate a mask of the identified target object based on the bounding box, wherein the image mask generator is acquired using a training method of the preceding examples of the present disclosure.
- In some examples, the image instance segmentation method further comprises: implementing object detection of the received image to identify a category of a target object in the received image; outputting the mask and category of the target object with the image mask generator.
- The image instance segmentation method according to the examples of the present disclosure can obtain accurate instance segmentation results not only for images with unblocked objects, but also for images with blocked objects. Using the pre-trained image mask generator, accurate and reliable instance segmentation results can also be obtained via the image segmentation method, improving the performance of the instance segmentation method for image content understanding, such as accuracy and reliability, and expanding the application of the instance segmentation technique in the real world of presenting complex image contents.
- The examples of the present disclosure also provide for a computer program product comprising a computer program that, when executed by a processor, implements a training method according to the previous examples of the present disclosure or an image instance segmentation method according to the previous examples of the present disclosure.
- The examples of the present disclosure also provide for a computer device comprising a processor, a memory, and a computer program stored on the memory that when executed by the processor implements a training method according to the previous examples of the present disclosure or an image instance segmentation method according to the previous examples of the present disclosure.
- Embodiments of the present disclosure also provide for a computer-readable storage medium. The computer-readable storage medium may be stored with executable code that, when executed by a computer, causes the computer to implement a training method according to the previous examples of the present disclosure or to implement an image instance segmentation method according to the previous examples of the present disclosure.
- For example, the computer-readable storage medium may include, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Electrically-Erasable Programmable Read-Only Memory (EPROM), Static Random Access Memory (SRAM), hard disk, flash memory, and the like.
- Specific examples of the present disclosure are described above. Other examples are within the scope of the appended claims. In some instances, the actions or steps recorded in the claims may be performed in a different order than in the examples and may still achieve the desired result. In addition, the process depicted in the drawings does not necessarily require a particular sequence or sequential sequence shown to achieve a desired result. In certain embodiments, multitasking and parallel processing may also be possible or may be advantageous.
- Not all steps and units in the above-mentioned processes and system structure diagrams are necessary, and some steps or units may be omitted based on actual needs. The device structure described in the above examples may be a physical structure or a logical structure, i.e., some cells may be implemented by the same physical entity, some cells may be implemented by a plurality of physical entities, respectively, or may be implemented collectively by certain components of the plurality of independent devices.
- The foregoing explanation of the embodiments describes the present disclosure only in the framework of the examples described. Of course, as long as the various features of the embodiment are technically significant and can be freely combined with one another, similar parts between the different examples can be referenced to one another without departing from the framework of the present disclosure.
- The present disclosure is described in detail above with reference to the specific examples. Obviously, the above description and the examples shown in the appended drawings should be understood as exemplary and do not constitute a limitation to the present disclosure. For those skilled in the art, various variants or modifications may be made thereto without departing from the spirit of the present disclosure, none of which are outside the scope of the present disclosure.
Claims (18)
- A training method for an image mask generator, comprising:selecting a sample image (100, 200) from each of the two sets of sample images to input a generated adversarial network comprising a generator (303) and a discriminator (305), each sample image comprising a target object, a target object of the first set of sample images among the two sets of sample images being unblocked, and a target object of the second set of sample images being partially blocked;generating, with the generator (303), the masks for the two sample images (100, 200) respectively, the mask for each sample image (100, 200) used for predicting a target object of the sample image;inputting the generated masks of the two sample images (100, 200) into the discriminator (305), and constructing the adversarial loss functions (307) with the discrimination results of the generated masks of the two sample images (100, 200) by using the discriminator (305);updating the parameters of the generator (303) according to the adversarial loss functions (307) to train the generator (303).
- The training method according to Claim 1, wherein the adversarial loss function (307) comprises an stack of a first loss item and a second loss item, whereinthe first loss item is constructed based on a first discrimination result of the mask of the first sample image (100) generated by the generator (303) based on the discriminator (305), the first sample image (100) being taken from the first set of sample images;The second loss item is constructed based on a second discrimination result of the mask of the second sample image (200) generated by the generator (303) based on the discriminator (305), the second sample image (200) being taken from the second set of sample images.
- The training method according to Claim 2, wherein the first discrimination result comprises: a probability that a mask of the first sample image (100) generated by the generator (303) is a mask of an image with an unblocked target object, according to the estimation of the discriminator (305);
The second discrimination result comprises: The discriminator (305) estimates a probability that a mask of the second sample image (200) generated by the generator (303) is a mask of an image with an at least partially blocked target object, according to the estimation of the discriminator (305). - The training method according to Claim 1, wherein the masks of the two sample images (100, 200) are generated separately with the generator (303), comprising:
generating, with the generator (303), a pixel-level mask probability of a target object for the two sample images (100, 200), respectively. - The training method according to any one of Claims 1-4, further comprising: Detecting the target objects of the two sample images (100, 200), obtaining the annotated images (110, 210) of the two sample images (100, 200) respectively, the annotated image (110, 210) of each sample image (100, 200) comprising the annotated result of a bounding box (111,211) of a target object (101, 201) for the sample image (100, 200);inputting the two sample images (100, 200) into the generative adversarial network, comprising: inputting the annotated images (110, 210) of the two sample images (100, 200) into the generative adversarial network;generating, with the generator (303), the masks for the two sample images (100, 200), respectively, comprising: generating, with the generator (303), the shield of a target object (101, 201) having the annotated results for the two sample images (100, 200), respectively.
- The training method according to Claim 4, wherein the training method comprises a plurality of iterative training processes to reduce the mask probability distribution differences of the target objects for two sample images (100, 200) generated by the generator (303), and/or enhancing the capability of the discriminator (305) to differentiate the mask types of the two sample images (100, 200) generated by the generator (303) for training purposes, repeating the steps of training the generator (303).
- The training method according to Claim 6, wherein the mask of the two sample images (100, 200) generated by the generator (303), respectively, comprises: generating, with the generator (303), a mask of the plurality of target objects for at least one of the two sample images (100, 200), the masks of the two sample images (100, 200) generated during each training process being input into the discriminator (305), comprising: filtering the masks of the plurality of target objects generated for at least one sample image to obtain the mask of one target object in each sample image, and inputting the mask of one target object in each sample image into the discriminator (305).
- The training method according to Claim 6, further comprising: determining whether the training termination condition is satisfied; if yes, terminating the training; if no, repeating the steps of training the generator (303);
the training termination condition comprises:terminating the iterative training processes when a loss function value determined according to the adversarial loss function (307) is within a first predetermined threshold range; and/or,obtaining a pixel count distribution map of a mask probability for a target object in the two sample images (100, 200), calculating the standard deviation of a pixel count distribution of the mask probability based on the pixel count distribution map of the mask probability, and terminating the iterative training processes when a difference in the standard deviation of the pixel count distribution of the mask probability for the target object in the two sample images is within a second predetermined threshold range. - The training method according to Claim 6, wherein each set of sample images in the two sets of sample images comprises a plurality of sample images, each sample image comprising at least one target object region, each target object region comprising at least one target object, the plurality of iterative training processes comprising:selecting one sample image from each of the two sets of sample images (100, 200) per iterative training process as a training sample to be inputted into the generative adversarial network, traversing a plurality of sample images in each set of sample images through the plurality of iterative training processes; and/or,Each sample image comprises a plurality of target object regions, inputting different target object regions of the same sample image (100, 200) as training samples into the generative adversarial network during different iterative training processes, and traversing different target object regions of the same sample image (100, 200) through different iterative training processes, respectively.
- The training method according to Claim 1, wherein the second set of sample images involves virtual images that form the partially blocked target objects by constructing a relative location relationship of the blocked and unblocked initial target objects.
- The training method according to Claim 10, wherein there are a plurality of the partially blocked target objects in the virtual image, the training method further comprising: obtaining a mask truth value of an unblocked initial target object corresponding to a partially blocked target object from the plurality of partially blocked target objects;
generating, with the generator (303), the masks for the two sample images (100, 200), respectively, comprising:
generating, with the generator (303), the masks of the plurality of partially blocked target objects in the virtual image; and, using the acquired mask truth value of the corresponding unblocked initial target object to filter the generated masks of the plurality of partially blocked target objects for acquiring the mask of a partially blocked target object generated by the generator (303). - The training method according to Claim 10, wherein object detection of the two sample images (100, 200) comprises:generating a bounding box of a corresponding partially blocked target object in the virtual image according to a bounding box of the unblocked initial target object to acquire an annotated image of one set of virtual images; or,generating a two-value mask of a partially blocked target object in the virtual image, and generating a bounding box of a partially blocked target object in the virtual image according to the generated two-value mask.
- The training method according to Claim 5, wherein the two sample images comprise a real image from the second set of sample images, implementing the object detection of the two sample images to obtain the annotated image of the two sample images respectively, comprising:
Implementing the object detection of the one real image by automatic annotation and/or manual annotation to obtain the annotated image of the one real image. - The training method according to Claim 5, wherein the annotated image (110, 210) of each sample image (100, 200) further comprises an annotated result for a category of a target object in the sample image (100, 200), the training method further comprising: generating a category of a target object in the two sample images (100, 200) with the generator (303).
- An image instance segmentation method, comprising:Implementing object detection of a received image to identify a bounding box of a target object in the received image;generating, with an image mask generator, a mask for identifying the target object based on the bounding box, wherein the image mask generator is acquired using the training method according to any one of Claims 1-14.
- The image instance segmentation method according to Claim 15, further comprising:
Implementing object detection of the received image to identify the category of a target object in the received image; outputting the mask and category of the target object with the image mask generator. - A computer program product comprising a computer program that, when executed by a processor, implements a training method of an image mask generator according to any one of Claims 1-14 or an image instance segmentation method according to Claim 15 or Claim 16.
- A computer device comprising a processor, a memory, and a computer program stored on the memory, wherein the computer program is executed by the processor to implement a training method of an image mask generator according to any one of Claims 1-14 or to implement an image instance segmentation method according to Claim 15 or Claim 16.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210914623.4A CN117557790A (en) | 2022-08-01 | 2022-08-01 | Training method of image mask generator and image instance segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4318395A1 true EP4318395A1 (en) | 2024-02-07 |
Family
ID=87419069
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP23186126.1A Pending EP4318395A1 (en) | 2022-08-01 | 2023-07-18 | A training method and an image instance segmentation method for an image mask generator |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP4318395A1 (en) |
CN (1) | CN117557790A (en) |
-
2022
- 2022-08-01 CN CN202210914623.4A patent/CN117557790A/en active Pending
-
2023
- 2023-07-18 EP EP23186126.1A patent/EP4318395A1/en active Pending
Non-Patent Citations (2)
Title |
---|
SALEH KAZIWA ET AL: "Occlusion Handling in Generic Object Detection: A Review", 2021 IEEE 19TH WORLD SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI), IEEE, 21 January 2021 (2021-01-21), pages 477 - 484, XP033888741, DOI: 10.1109/SAMI50585.2021.9378657 * |
YAN XIAOSHENG ET AL: "Visualizing the Invisible: Occluded Vehicle Segmentation and Recovery", 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 27 October 2019 (2019-10-27), pages 7617 - 7626, XP033723078, DOI: 10.1109/ICCV.2019.00771 * |
Also Published As
Publication number | Publication date |
---|---|
CN117557790A (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10229346B1 (en) | Learning method, learning device for detecting object using edge image and testing method, testing device using the same | |
CN115601374B (en) | Chromosome image segmentation method | |
CN110310264B (en) | DCNN-based large-scale target detection method and device | |
CN110428432B (en) | Deep neural network algorithm for automatically segmenting colon gland image | |
CN110866455B (en) | Pavement water body detection method | |
CN111311611B (en) | Real-time three-dimensional large-scene multi-object instance segmentation method | |
CN112836625A (en) | Face living body detection method and device and electronic equipment | |
CN109271957B (en) | Face gender identification method and device | |
CN112417955A (en) | Patrol video stream processing method and device | |
CN115187530A (en) | Method, device, terminal and medium for identifying ultrasonic automatic breast full-volume image | |
CN112560584A (en) | Face detection method and device, storage medium and terminal | |
CN113177554B (en) | Thyroid nodule identification and segmentation method, system, storage medium and equipment | |
CN112926667B (en) | Method and device for detecting saliency target of depth fusion edge and high-level feature | |
KR102337687B1 (en) | Artificial neural network-based target region extraction apparatus, method and learning method thereof | |
CN112182269B (en) | Training of image classification model, image classification method, device, equipment and medium | |
CN111753775B (en) | Fish growth assessment method, device, equipment and storage medium | |
EP4318395A1 (en) | A training method and an image instance segmentation method for an image mask generator | |
CN115205855B (en) | Vehicle target identification method, device and equipment integrating multi-scale semantic information | |
CN116740758A (en) | Bird image recognition method and system for preventing misjudgment | |
CN110135382A (en) | A kind of human body detecting method and device | |
CN115937991A (en) | Human body tumbling identification method and device, computer equipment and storage medium | |
CN114663347B (en) | Unsupervised object instance detection method and unsupervised object instance detection device | |
CN116091784A (en) | Target tracking method, device and storage medium | |
CN114926631A (en) | Target frame generation method and device, nonvolatile storage medium and computer equipment | |
CN113554685A (en) | Method and device for detecting moving target of remote sensing satellite, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |