GB2618876A

GB2618876A - Lightweight and efficient object segmentation and counting method based on generative adversarial network (GAN)

Info

Publication number: GB2618876A
Application number: GB2301554.8A
Authority: GB
Inventors: He Xianding; Deng Lijia
Original assignee: Chengdu Aeronautic Polytechnic
Current assignee: Chengdu Aeronautic Polytechnic
Priority date: 2022-05-18
Filing date: 2023-02-03
Publication date: 2023-11-22
Anticipated expiration: 2043-02-03
Also published as: CN114648724A; CN114648724B; GB2618876B

Abstract

Used in video image processing, an object segmentation and counting method based on a generative adversarial network (GAN) comprises an encoder, decoder and predictor stage. A feature map is produced by a down sampling encoder from an image. A counting layer predicts a quantity of objects in the image and a decoder employs fold beyond-nearest up-sampling (FBU) which reduces calculations and accelerates neural network operation, improves efficiency and optimizes structure. A predictor produces a predicted segmented feature map which is passed with a truth value of the image to a coordination discriminator which subsequently obtains a generator discrimination matrix. Total training loss is calculated based on accumulated data which is iteratively used to obtain a trained generator. A fused feature map is produced from the input image which is subsequently processed by feature extraction convolution modules to produce a discriminant feature map. A predicted fake and real discrimination matrices are subsequently produced and the total training loss is obtained and fed back, with steps of the method being repeated until generators and discriminators meet a predetermined condition.

Description

LIGHTWEIGHT AND EFFICIENT OBJECT SEGMENTATION AND COUNTING METHOD BASED ON GENERATIVE ADVERSARIAL NETWORK (GAN)

TECHNICAL FIELD

100011 The present disclosure relates to the technical field of video image processing, and in particular, to a lightweight and efficient object segmentation and counting method based on a generative adversarial network (CAN).

BACKGROUND

[0002] At present, the task of object counting is mainly based on a density map. This method can display a position distribution of the object by using the density map and obtain a total object quantity by calculating a numerical value of the density map. Although this method can both count the object and obtain a distribution of the object, this method imposes high requirements for complexity of a network and collection of a dataset. When the density map is used, each target point in an image needs to be labeled in the dataset to obtain a point diagram with precise coordinate positions, which is time-consuming and laborious. Then the density map is obtained by mathematically calculating the point diagram. Such density map is usually generated by using the following methods: 1. expanding the point diagram by using a fixed-size Gaussian kernel; 2. expanding the point diagram by using an adaptive Gaussian kernel; and 3. expanding the point diagram by using a perspective map matching a scenario. However, these methods all have various limitations. For example, when the fixed-size Gaussian kernel is used, it is impossible to reflect a scaling change of the object in the image, resulting in poor overlapping between an expanded region and an actual object. The adaptive Gaussian kernel can only be applied to a high-density scenario. For sparse objects, it is difficult to use the adaptive Gaussian kernel to obtain a reasonable Gaussian kernel size. Moreover, although the perspective map can be used to obtain a most accurate Gaussian kernel size, the perspective map is not collected for most of datasets, because the perspective map needs to accurately match a perspective change relationship between a visual angle of a camera and the ground, and it is complex to perform collection calculation.

[0003] In addition to the above shortcomings in dataset production of the density map, the density map itself has sparse datasets. Currently, most datasets only include a total object quantity in the image or overall segmentation of the object. When the density map is used for counting, the network has fewer datasets to choose during pre-training, and the network is prone to overfitting during pre-training, which makes it difficult for the network to transfer to other scenes.

100041 During object counting, a method based on simple regression can directly predict the total object quantity based on a dataset containing a label indicating the total object quantity, expanding selectivity of the dataset. However, this method usually lacks position information of the object, and is easy to be criticized for reliability of prediction.

100051 Sometimes a complex task can be parsed into a plurality of simple tasks to complete. For a neural network, if different tasks are strongly correlated, such as object identification and behavior determining, one network can be used to perform feature extraction and complete result prediction. If tasks are poorly correlated, such as prediction of an object quantity and segmentation of an object contour, the tasks are poorly completed if a simple single network is used. Generally, a plurality of targeted neural networks are usually used to complete a plurality of tasks respectively. However, in this method, a plurality of neural networks need to be used at the same time, which results a too large overall volume of a model. At current time when distributed calculation is popular, this large-volume model is not conducive to an actual deployment. Therefore, it is expected to use a single network as far as possible to complete a multitask function to save computer resources.

100061 At present, a multitask generator is mainly trained in an end-to-end direct training manner. Although this training manner is used by most neural networks at present, when this training manner is used to train a multitask model, it is usually required to design a unique multi-column network model to deal with a plurality of tasks and design a complex loss function to coordinate a plurality of task goals, and a training speed is usually slow, which requires longer time to complete training of the tasks.

SUMMARY

[0007] In order to resolve the above problems in the prior art, the present disclosure is intended to provide a lightweight and efficient object segmentation and counting method based on a CAN. The present disclosure refines an object counting task into object quantity prediction and object region segmentation. A dataset containing only a label indicating a total object quantity and a label indicating an object region is used for training, such that the total object quantity and the object region segmentation can be predicted simultaneously. This overcomes a limitation that only a density map dataset can be used in a density map method, and a defect that a simple regression method lacks object position information.

[0008] The present disclosure adopts the following technical solutions: [0009] A lightweight and efficient object segmentation and counting method based on a GAIN includes the following steps: 100101 step 1: obtaining an input image: processing all input images based on a same size, and making truth values of the input images have a same size as a training image, such that the input images one-to-one correspond to the truth values of the images; 100111 step 2: sending a processed input image in the step I to a down-sampling encoder for feature extraction to obtain a deepest-layer feature map; 100121 step 3: sending the deepest-layer feature map to a counting layer to predict a quantity of objects in the whole input image; [0013] step 4: sending the deepest-layer feature map to a fold beyond-nearest up-sampling (FBU) module to obtain an expanded feature map; [0014] step 5: performing feature fusion on the deepest-layer feature map and the expanded feature map to obtain a first final feature map; [0015] step 6: sending the first final feature map to the FM! module in the step 4 as the deepest-layer feature map, repeatedly performing the steps 4 and 5 until a second final feature map meeting a requirement is obtained, and sending the second final feature map to a predictor to obtain a predicted segmented feature map, where [0016] in the present disclosure, when a size of the second final feature map is 1/2 of the size of the input image, the second final feature map is defined to meet a requirement; and this size is only an optimal value adopted in the present disclosure, and is not used to limit that the size of the second final feature map cannot be 1/3 or 3/2 or 4/1 of the size of the input image.

100171 step 7: sending both the predicted segmented feature map and the truth value of the image to a coordination discriminator, and using the coordination discriminator to perform learning and determine a difference between the predicted segmented feature map and a truth value chart of the image, so as to obtain a generator discrimination matrix; [0018] step 8: generating a verified valid matrix with a same size as the generator discrimination matrix and a value of 1, and a verified fake matrix with a value of 0; 100191 step 9: calculating a total training loss of a generator based on a truth value of the quantity of objects in a dataset, the truth value of the input image that is obtained in the step 1, the quantity of objects that is obtained in the step 3, the predicted segmented feature map obtained in the step 6, the generator discrimination matrix obtained in the step 7, and the verified valid matrix generated in the step 8; [0020] step 10: sending the total training loss of the generator back to the generator network for iterative network updating and learning, and performing a round of training on the generator to obtain a trained generator; [0021] step 11: fusing the processed input image in the step 1 and the predicted segmented feature map obtained in the step 6 on an image channel to obtain a fused feature map, and sending the fused feature map to the coordination discriminator; [0022] step 12: processing, by four feature extraction convolution modules in the coordination discriminator, the fused feature map to obtain a first deep-layer discriminant feature map; [0023] step 13: inputting the first deep-layer discriminant feature map obtained in the step 12 into a structural feature determining layer composed of one convolutional layer to obtain a predicted fake discrimination matrix containing a structural difference; [0024] step 14: fusing the processed input image in the step 1 and a truth-value image on the image channel, sending a fused image to the coordination discriminator for processing by the four feature extraction convolution modules, to obtain a second deep-layer discriminant feature map, and then inputting the second deep-layer discriminant feature map into the structural feature determining layer composed of one convolutional layer to obtain a predicted real discrimination matrix; 100251 step 15: calculating a total training loss of the coordination discriminator based on the verified valid matrix and the verified fake matrix that are obtained in the step 8, the predicted fake discrimination matrix obtained in the step 13, and the predicted real discrimination matrix obtained in the step 14; [0026] step 16: sending the total training loss obtained in the step 15 back to the network for network iterative learning, performing a round of training on the coordination discriminator to obtain a trained coordination discriminator, and saving the generator obtained in the step 10 and the coordination discriminator obtained in this step; and [0027] step 17: repeating the steps 2 to 16 until a generator and a coordination discriminator that meet a predetermined condition are obtained.

[0028] When the predicted object quantity in the step 3 and the predicted segmented feature map in the step 6 are very close to or even the same as the truth value of the image, or the two total training losses in the step 9 and the step 15 are no longer reduced, the training can be stopped, that is, the steps 2 to 16 are no longer executed.

[0029] Based on the above technical solutions, the present disclosure constructs a lightweight and efficient multi-scale-feature-fusion multitask generator. This method can directly predict the object quantity through training by using a dataset containing a label indicating a total object quantity, and can directly generate an object distribution range through training by using a dataset containing a label indicating an object position. The generator can directly predict the quantity of objects in the input image by performing the steps 2 and 3, and directly predict position regions of the objects in the input image by performing the steps 2 and 4 to 6. The generator can predict the total object quantity and object position segmentation at the same time, which overcomes a limitation that only a density map dataset can be used in a density map-based counting method, such that the network can use a simple dataset containing only the total object quantity. The generator can predict the object position, which overcomes a defect that a simple regression method lacks object position information.

100301 In addition, in order to improve training efficiency of a neural network for a multitask object, based on a CAN and mutual confrontation between the generator and the discriminator, this technology proposes a new multitask generator training method to improve the training efficiency of the network. The present disclosure provides a coordination discriminator for providing assistance in coordinating multitask training, so as to improve training efficiency of the multitask generator in generative adversarial learning, improve distribution of attention during multitask training, and lower a design requirement for a loss function in the training process. In addition, the present disclosure proposes a modular and convenient joint hybrid loss function for training the multitask generator of counting and image segmentation tasks. 100311 Preferably, the down-sampling encoder described in the step 2 includes six down-sampling modules, the first five down-sampling modules have a same structure and each include one convolutional layer with a stride of 2, one instance homogenization layer, and one linear rectifier function with leak (hereinafter referred to as leaky ReLU) activation layer, and the last down-sampling module includes one convolutional layer, one random dropout layer, and one leaky ReLU.

100321 The present disclosure uses six down-sampling units whose sizes are less than half of a size of a classical feature extraction model, namely, a visual geometry group 16 (VGG16). This leaves a lot of memory redundancy for further adding a decoder. The down-sampling unit adopts a convolutional layer with a stride of 2, which reduces a size of a feature mapping while extracting a feature, and avoids a feature loss caused by using a pooling layer.

100331 Preferably, the counting layer in the step 3 includes one global average pooling layer and one convolutional layer.

[0034] The global average pooling layer is used to collapse the deepest-layer feature map into a fixed-size feature map, such that the fixed-size feature map is predicted by a fixed convolutional layer. This enables the network to adapt to input images of different sizes, and improves universality of a model.

[0035] Preferably, the FBU module in the step 4 includes one convolutional layer, two flatten layers, and two linear layers.

100361 After the deepest-layer feature map is sent to the FBU module, a newly added pixel needed for image expansion is generated in the image channel through operation at the convolutional layer.

100371 Horizontal matrix flattening is performed once on a deepest-layer feature map with the newly added pixel to obtain a horizontal linear vector, and a linear mapping matrix reconstruction calculation is performed on the horizontal linear vector to obtain an image f). Then, vertical flatting is performed on the image 12 to obtain a vertical linear vector, and then a linear mapping matrix reconstruction calculation is performed on the vertical linear vector to arrange the linear vector based on an expanded height and a width of the original deepest-layer feature map, such that the newly added pixel is transferred to a height of the original deepest-layer feature map to obtain the expanded feature map.

100381 Preferably, the predictor described in the step 6 includes a convolutional layer with a size of 4, a FBU module, and a hyperbolic tangent activation function activation layer, the convolutional layer with a size of 4 performs feature prediction on the second final feature map, the final feature map is expanded by the FM! module to generate the predicted segmented feature map with a same size as the input image described in the step 1, and then the hyperbolic tangent activation function activation layer activates the predicted segmented feature map to speed up training convergence and obtain a trained predicted segmented feature map.

100391 The present disclosure activates the predicted segmented feature map by using the hyperbolic tangent activation function activation layer to accelerate training convergence, so as to output a predicted segmented feature map with better quality after training.

100401 Preferably, the total training loss of the generator in the step 9 is calculated according to the following steps: [0041] calculating a loss between the predicted segmented feature map and the truth value of the image by using an Li loss function, to obtain a loss of a segmentation task; [0042] calculating a loss between the quantity of objects in the step 3 and the truth value of the quantity of objects by using an L2 loss function, to obtain a loss of a counting task; 100431 calculating a loss between the generator discrimination matrix and the verified valid matrix by using the L2 loss function, to obtain a discrimination loss of the generator; and [0044] performing weighting and summation on the loss of the segmentation task, the loss of the counting task, and the discrimination loss of the generator to obtain the total training loss of the generator.

[0045] By using the Li and L2 loss functions, the present disclosure can adjust weights of counting and segmentation based on a specific application environment to improve a training effect.

[0046] Preferably, the step 12 includes the following steps: [0047] step 12.1: supplementing a blank pixel with a size of 4 around the fused feature map, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map; 100481 step 12.2: sending a processed fused feature map in the step 12.1 to one convolutional layer with a size of 8*8 and a stride of 2 for feature extraction and fusion with a large degree of perception; 100491 step 12.3: supplementing a blank pixel with a size of 3 around a processed fused feature map in the step 12.2, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map; [0050] step 12.4: sending a processed fused feature map in the step 12.3 to one convolutional layer with a size of 6*6 and a stride of 2 for second feature extraction; [0051] step 12.5: supplementing a blank pixel with a size of 2 around a processed fused feature map in the step 12.4, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map; and 100521 step 12.6: sending a processed fused feature map in the step 12.5 to two consecutive convolutional layers with a size of 4*4 and a stride of 2 for third feature extraction to obtain the first deep-layer discriminant feature map.

[0053] The step 12 in the present disclosure includes four feature extraction modules, and each of the feature extraction modules is composed of one convolutional layer and a related complex functional layer. Perception domains of these four feature extraction modules gradually shrink. At an early stage, a feature can be quickly extracted to obtain a wider range of structural relevance, and a size of the feature map can be rapidly reduced to reduce calculations. Subsequently, feature extraction is gradually refined to obtain a more accurate eigenvalue.

100541 Preferably, in the step 13, the first deep-layer discriminant feature map obtained in the step 12 is sent to one convolutional layer with a size of 3*3 and a stride of 1 for structural feature determining to output the predicted fake discrimination matrix containing the structural difference.

[0055] Further, after the calculation by the convolutional layer, instance homogenization and leaky ReLU activation need to be separately performed on a calculation result of the convolutional layer to prevent a gradient loss or gradient explosion during training.

[0056] Preferably, the step 15 includes the following steps: 100571 calculating a loss between the predicted real discrimination matrix and the verified valid matrix and a loss between the predicted real discrimination matrix and the verified fake matrix by using an L2 loss function, and performing summation to obtain a predicted real discrimination loss; 100581 calculating a loss between the predicted fake discrimination matrix and the verified valid matrix and a loss between the predicted fake discrimination matrix and the verified fake matrix by using the L2 loss function, and performing summation to obtain a predicted fake discrimination loss; and [0059] calculating an average value of the predicted real discrimination loss and the predicted fake discrimination loss, and taking the average value as the total training loss of the coordination discriminator, [0060] The present disclosure achieves the following beneficial effects: [0061] 1. The present disclosure provides a lightweight and fast multi-scale-feature-fusion multitask generator (LFMMG), which realizes counting without point labels and clear object position prediction. Compared with a U-shaped network (U-Net), the LFMNIG in the present disclosure reduces parameters by more than 50%, and a size of a feature extraction encoder of the LFMMG is only 37% of that of a VGG16. By reducing a quantity of interpolated up-sampling layers and using a FBU method, the present disclosure significantly reduces calculations and a memory consumption in a decoder. A global average pooling layer and a convolutional layer are used together, such that a generator in the present disclosure can be compatible with an input image of any size, instead of only being compatible with an input image with a fixed size when a fully connected layer is used.

[0062] 2. The present disclosure optimizes a classical design of the VGG16, and a volume of an optimized VGG16 is only 37% of a volume of the VGG16.

100631 3. In a predictor stage, the present disclosure sets an independent predictor for each task to meet unique needs of different tasks. In addition, a network model of the present disclosure can be compatible with input images of different sizes, which improves universality of a network. The present disclosure also divides an object quantity statistics task based on a density map into two tasks: quantity prediction and position prediction, which reduces difficulty of learning and expands a usable range of a dataset during pre-training.

100641 4. In the encoder, the present disclosure optimizes a structural design of the model. The present disclosure uses eight down-sampling units whose sizes are only half of the size of the VGG16. This leaves a lot of memory for further adding the decoder. The down-sampling unit adopts a convolutional layer with a stride of 2, which reduces a size of a feature mapping while extracting a feature, and avoids a feature loss caused by using a pooling layer.

[0065] 5. The present disclosure proposes the FBU method to expand a size of a feature map. After studying and comparing various up-sampling methods, the present disclosure designs the FBU method, which will be described in the next section. Compared with a traditional nearest neighbor interpolation method, the FBU method has a simple calculation process and can speed up a calculation performed by the model. In addition, the FBU method not only expands a feature size, but also reduces an external error of an interpolation up-sampling layer. In addition, compared with traditional non-learning up-sampling methods (such as nearest neighbor up-sampling and bilinear interpolation up-sampling), the FBU method is learnable. The present disclosure adds learnable parameters to enable the FBU method to better magnify a boundary change in an image.

100661 6. The present disclosure provides a complete multitask generator training method based on a GAN. This training method can train a multitask generator that can simultaneously generate a predicted image and predict data This training method improves a training speed of the generator by using a coordination discriminator and a norm joint hybrid loss function, such that the network can be trained more quickly.

[0067] 7. The present disclosure uses the coordination discriminator to compare a difference between a predicted segmented feature map and a truth-value image and a difference between the predicted segmented feature map and an original image, such that the generator can further focus on an overall contour change of the image when learning a feature related to data prediction, so as to make a predicted segmentation feature image further approximate to the truth-value image.

100681 8. The present disclosure provides a modelized norm joint hybrid loss function for generation of a predicted image and data prediction training, which lowers a requirement for a mathematical ability of a trainer. The loss function can be well compatible with training of counting and segmentation tasks, and can adjust weights of the two tasks based on an application scenario to obtain a better training effect.

BRIEF DESCRIPTION OF THE DRAWINGS

[0069] FIG. I is a schematic diagram of a lightweight and efficient multi-scale-feature-fusion multitask generator according to the present disclosure; 100701 FIG. 2 is a schematic diagram of an FBU method according to the present disclosure; [0071] FIG. 3 is a schematic diagram of a coordination discriminator according to the present disclosure; 100721 FIG. 4 is a schematic flowchart of iterative network updating in a training process according to the present disclosure; and [0073] FIG. 5 shows a cell microscope image according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0074] In order to make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the following clearly and completely describes the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are some rather than all of the embodiments of the present disclosure. Generally, the components of the embodiments of the present disclosure described and shown in the accompanying drawings may be provided and designed in various manners. Therefore, the detailed description of the embodiments of the present disclosure with reference to the accompanying drawings is not intended to limit the protection scope of the present disclosure, but merely to represent the selected embodiments of the present disclosure. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts should fall within the protection scope of the present disclosure.

100751 The following describes the embodiments of the present disclosure in further detail with reference to FIG. 1, FIG. 2, FIG. 3, FIG. 4, and FIG. 5.

[0076] A lightweight and efficient object segmentation and counting method based on a GAN includes the following steps.

100771 Step 1: Obtain an input image, specifically, process all input images based on a same size, and make truth values of the input images have a same size as a training image, such that the input images one-to-one correspond to the truth values of the images. In the step 1, the size of the input image can be properly reduced to ease operation burden of a computer.

[0078] Step 2: Send a processed input image in the step 1 to a down-sampling encoder for feature extraction to obtain a deepest-layer feature map.

[0079] The down-sampling encoder includes six down-sampling modules, the first five down-sampling modules have a same structure and each include one convolutional layer with a stride of 2, one instance homogenization layer, and one leaky ReLU activation layer, and the last down-sampling module includes one convolutional layer, one random dropout layer, and one leaky ReLU. The input image is successively processed by the six down-sampling modules to obtain a deepest-layer feature map that is reduced by 64 times.

100801 The present disclosure uses six down-sampling units whose sizes are less than half of a size of' a VGG16. This leaves a lot of memory redundancy for further adding a decoder. The down-sampling unit adopts a convolutional layer with a stride of 2, which reduces a size of a feature mapping while extracting a feature, and avoids a feature loss caused by using a pooling layer, [0081] Step 3: Send the deepest-layer feature map to a counting layer to predict a quantity of objects in the whole input image.

100821 The counting layer includes one global average pooling layer and one convolutional layer.

100831 The global average pooling layer is used to collapse the deepest-layer feature map into a fixed-size feature map, such that the fixed-size feature map is predicted by a fixed convolutional layer. This enables a network to adapt to input images of different sizes, and improves universality of a model.

[0084] Step 4: Send the deepest-layer feature map to an FBU module to obtain an expanded feature map.

[0085] The FBU module in the step 4 includes one convolutional layer, two flatten layers, and two linear layers.

[0086] After the deepest-layer feature map is sent to the FBU module, a newly added pixel needed for image expansion is generated in an image channel through operation at the convolutional layer.

[0087] Matrix flattening is performed once on a deepest-layer feature map with the newly added pixel, and after a flattened feature map is stretched into a linear vector, a linear mapping matrix reconstruction calculation is performed, in other words, the linear vector is arranged based on an expanded height and a width of the original deepest-layer feature map, such that the newly added pixel is transferred to a height of the original deepest-layer feature map to obtain the expanded feature map.

100881 During running on a width of the image, other newly added pixels are reconstructed on the width of the image to increase the size of the image. For an image x whose size is w, h, c, the following specific steps can be performed to increase the size of the image by d times to reach Ey, 100891 increasing a channel quantity c of the image to e x d2 by using the convolutional layer; flattening an expanded image into a one-dimensional vector v from a horizontal direction, and then reconstructing the vector v into an image "V with a size of w, h x d,1C x d; and flattening 13 into a one-dimensional vector u from a vertical direction, and then reconstructing the vector u into a new image with a size of w x d, h x ci, , such that the image is expanded, where a mathematical expression is as follows: = K x; X-> = vh,n; 11 = [0090] where K represents a convolution kernel, w, h, c respectively represent a width, a height, and the channel quantity of the image, x represents the original image, ic represents the image obtained after channel expansion, v represents the vector obtained after the first flattening, /l) represents the image obtained after a channel in v is transformed to the height of the image, it represents the vector obtained after the second flattening, and I/ represents the image after the channel in u is transformed to the width of the image.

100911 Step 5: Perform feature fusion on the deepest-layer feature map and the expanded feature map to obtain a first final feature map, where during feature fusion, a highly abstract feature from a deep network gradually obtains low-level features such as texture and contour, so as to further make the image clearer.

100921 Step 6: Send the first final feature map to the FBU module in the step 4 as the deepest-layer feature map, repeatedly perform the steps 4 and 5 until a second final feature map meeting a requirement is obtained, and send the second final feature map to a predictor to obtain a predicted segmented feature map.

[0093] The predictor described in the step 6 includes a convolutional layer with a size of 4, a FBU module, and a hyperbolic tangent activation function activation layer, the convolutional layer with a size of 4 performs feature prediction on the second final feature map, the final feature map is expanded by the FBU module to generate the predicted segmented feature map with a same size as the input image described in the step 1, and then the hyperbolic tangent activation function activation layer activates the predicted segmented feature map to speed up training convergence and obtain a trained predicted segmented feature map.

100941 The present disclosure activates the predicted segmented feature map by using the hyperbolic tangent activation function activation layer to accelerate training convergence, so as to output a predicted segmented feature map with better quality after training.

100951 Step 7: Send both the predicted segmented feature map and the truth value of the image to a coordination discriminator, and use the coordination discriminator to perform learning and determine a difference between the predicted segmented feature map and a truth value chart of the image, so as to obtain a generator discrimination matrix.

[0096] Step 8: Generate a verified valid matrix with a same size as the generator discrimination matrix and a value of 1, and a verified fake matrix with a value of 0.

100971 Step 9: Calculate a total training loss of a generator based on the quantity of objects that is obtained in the step 3, the predicted segmented feature map obtained in the step 6, the generator discrimination matrix obtained in the step 7, and the verified valid matrix generated in the step 8.

[0098] The total training loss of the generator in the step 9 is calculated according to the following steps: [0099] A loss between the predicted segmented feature map and the truth value of the image is calculated by using an LI loss function, and a loss between the predicted quantity of objects and the truth value of object quantity by using an L2 loss function, to obtain a loss of a segmentation task. A specific expression is as follows: L Cpred =R gt precii)2 Limp = -R l(Sgt -G31 Lossc = aLp"d + bLimg 101001 where R represents a quantity of images, i represents a serial number of an image, cgt represents a truth value of image data, cpred represents a prediction result generated by the generator, Sg, represents a truth value of the image, G i represents a predicted segmented feature map of the i image, a represents a weight of a loss of the prediction result of an object quantity, and b represents a weight of a loss of the predicted segmented feature map. Since the generator has performed prediction from image feature prediction to object quantity prediction in an early stage, using a data prediction result in the early stage has a greater impact on a feature extraction direction of the generator. In order to balance the weights of the two tasks, a value of a is defaulted to 0.5, and a value of Li is defaulted to 100. The weight values can be fine tuned based on a type and a demand of a task. For example, when an image feature is not obvious, complexity is high, and it is difficult to generate a task, the value of b can be appropriately increased or the value of a can be appropriately decreased.

101011 Considering an impact of the discriminator on distribution of task attention of the generator, when the generator is trained, it is necessary to calculate a generated discrimination loss by assuming that a generated image is a completely authentic and reliable segmented image. A difference between pixel values of the image is usually calculated by using an L2 loss function. A loss between the generator discrimination matrix and the verified valid matrix is calculated by using the L2 loss function. A specific expression is as follows: oL vssd= -alid1)2 i= 101021 where R represents the quantity of images, i represents the serial number of the image, Dg represents a generator discrimination matrix of the id' image, and valid represents the verified valid matrix.

[0103] Weighting and summation are performed on the loss of the segmentation task and the discrimination loss of the generator to obtain the total training loss of the generator. A specific expression is as follows: = LOSSG + LOSSdis,

R R R = -

f? -valid)2 \ + a -1 1(c R -Cpred j i ± b-R l(Sg, -GDI 1 2 1 [0104] where R represents the quantity of images, i represents the serial number of the image, Gloss D9. represents the generator discrimination matrix of the iih image, valid represents the verified valid matrix, cp, represents the truth value of the image data, cpred represents the prediction result generated by the generator, S9 represents the truth value of the image, Gi represents the predicted segmented feature map of the Ph image, a represents the weight of the loss of the prediction result, and 1-) represents the weight of the loss of the predicted segmented feature map.

[0105] By using the Li and L2 loss functions, the present disclosure can adjust weights of counting and segmentation based on a specific application environment to improve a training effect.

[0106] Step 10: Send the total training loss of the generator back to the generator network for network iterative learning, and perform a round of training on the generator to obtain a trained generator.

[0107] Step 11: Fuse the processed input image in the step 1 and the predicted segmented feature map obtained in the step 6 on the image channel to obtain a fused feature map, and send the fused feature map to the coordination discriminator.

[0108] Specifically, feature fusion is performed on an input image xi and a predicted segmented feature map Gi on the image channel to obtain an input feature map Di of the discrimination matrix: Di= [xi G] E Rwih [0109] where Rw'h represents that widths and heights of the input image xi and the predicted segmented feature map Gi remain unchanged, and r] represents that channel dimensions of the input image xi and the predicted segmented feature map Gi are added up.

[0110] Step 12: Process, by four feature extraction convolution modules in the coordination discriminator, the fused feature map to obtain a first deep-layer discriminant feature map. [0111] Each of the feature extraction convolution modules is composed of one convolutional layer and a related complex functional layer. Perception domains of these four feature extraction modules gradually shrink. At an early stage, a feature can be quickly extracted to obtain a wider range of structural relevance, and a size of the feature map can be rapidly reduced to reduce calculations. Subsequently, feature extraction is gradually refined to obtain a more accurate eigenvalue.

[0112] Step 12.1: Supplement a blank pixel with a size of 4 around the fused feature map, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map. [0113] Step 12.2: Send a processed fused feature map in the step 12.1 to one convolutional layer with a size of 8*8 and a stride of 2 for feature extraction and fusion with a large degree of perception.

101141 Step 12.3: Supplement a blank pixel with a size of 3 around a processed fused feature map in the step 12.2, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map.

[0115] Step 12.4: Send a processed fused feature map in the step 12.3 to one convolutional layer with a size of 6*6 and a stride of 2 for second feature extraction.

101161 Step U.5: Supplement a blank pixel with a size of 2 around a processed fused feature map in the step 12.4, so as to avoid a feature loss caused by an odd side length in a subsequent fused feature map.

[0117] Step 12.6: Send a processed fused feature map in the step 12.5 to two consecutive convolutional layers with a size of 4*4 and a stride of 2 for third feature extraction to obtain the first deep-layer discriminant feature map.

[0118] After the calculation by the convolutional layer, instance homogenization and leaky ReLU activation need to be separately performed on a calculation result of the convolutional layer to prevent a gradient loss or gradient explosion during training.

[0119] Step 13: Input the first deep-layer discriminant feature map obtained in the step 12 into a structural feature determining layer to obtain a predicted fake discrimination matrix containing a structural difference.

[0120] In the step 13, the first deep-layer discriminant feature map obtained in the step 12 is sent to one convolutional layer with a size of 3*3 and a stride of 1 for structural feature determining to output the predicted fake discrimination matrix containing the structural difference.

[0121] After the calculation by the convolutional layer, instance homogenization and leaky ReLU activation need to be separately performed on a calculation result of the convolutional layer to prevent a gradient loss or gradient explosion during training.

[0122] Step 14: Fuse the processed input image in the step 1 and a truth-value image on the image channel, send a fused image to the coordination discriminator for processing by the four feature extraction convolution modules, to obtain a second deep-layer discriminant feature map, and then input the second deep-layer discriminant feature map into the structural feature determining layer composed of one convolutional layer to obtain a predicted real discrimination matrix.

[0123] Step 15: Calculate a total training loss of the coordination discriminator based on the verified valid matrix and the verified fake matrix that are obtained in the step 8, the predicted fake discrimination matrix obtained in the step 13, and the predicted real discrimination matrix obtained in the step 14.

[0124] Specifically, 101251 A loss between the predicted real discrimination matrix and the verified valid matrix and a loss between the predicted real discrimination matrix and the verified fake matrix are calculated by using the L2 loss function, and are summated to obtain a predicted real discrimination loss. A specific expression is as follows 11 R 2 LOSS"ai = (Dreali -validi)2 + (Drea;j -fakei) i=t i=i [0126] where I? represents the quantity of images, i represents the serial number of the image, Dreali verified valid matrix, and fake represents the verified fake matrix.

101271 A loss between the predicted fake discrimination matrix and the verified valid matrix and a loss between the predicted fake discrimination matrix and the verified fake matrix are calculated by using the L2 loss function, and are summated to obtain a predicted fake discrimination loss. A specific expression is as follows: Loss fake = -1(11) fon - + -R 2 11( D fai" -fakei)2 i=1 i=i [0128] where R represents the quantity of images, i represents the serial number of the image, Drake i verified valid matrix, and fake represents the verified fake matrix.

[0129] In order to make the coordination discriminator be capable of determining quality of the predicted segmented feature map, the coordination discriminator can neither only extract a feature with a truth value, nor only a feature of the predicted segmented feature map, but needs to take both into account, so as to help to determine the quality of the predicted segmented feature map during generator training, and help to correct a learning attention deviation of the generator during data prediction. Therefore, an average value of the predicted real discrimination loss and the predicted fake discrimination loss is calculated and taken as the total training loss of the coordination discriminator. A specific expression is as follows: = avg(Lossmil + Lossfake) 101301 where Lossrecii represents an L2 loss between the predicted real discrimination matrix between the verified valid matrix and an L2 loss between the predicted real discrimination matrix and the verified fake matrix, and Lossfake represents an L2 loss between the predicted fake discrimination matrix between the verified valid matrix and an L2 loss between the predicted fake discrimination matrix and the verified fake matrix.

[0131] Step 16: Send the total training loss obtained in the step 15 back to the network for iterative network updating and learning, perform a round of training on the coordination discriminator to obtain a trained coordination discriminator, and save the generator obtained in represents a predicted real discrimination matrix of the Oh image, valid represents the represents the predicted real discrimination matrix of the ith image, valid represents the the step 10 and the coordination discriminator obtained in this step.

101321 Step 17: Repeat the steps 2 to 16 until a generator and a coordination discriminator that meet a predetermined condition are obtained. [0133] Based on the above technical solutions, the present disclosure constructs a lightweight and efficient multi-scale-feature-fusion multitask generator. The generator can be trained by using a dataset containing only a label indicating a total object quantity and a label indicating an object region, such that the total object quantity and the object region segmentation can be predicted simultaneously. This overcomes a limitation that only a density map dataset can be used in a density map method, and a defect that a simple regression method lacks object position information.

[0134] The method in the present disclosure uses a model framework of a codec. Although multi-scale feature fusion based on a segmentation model of the codec can well generate a segmented image with a good low-level feature, this structure is complex, and the model takes up a large amount of memory, which cannot meet a requirement of the present disclosure for a light weight. However, without an encoder-decoder architecture, performance of this type of model cannot meet a requirement of the present disclosure. The present disclosure discovers that a VGG16-based feature extraction network has a good feature extraction capability, and there is a lot of redundancy in the network structure.

[0135] Therefore, in the encoder, the present disclosure optimizes a structural design of the model. The present disclosure uses the six down-sampling units whose sizes are less than half of the size of the VGG16. This leaves a lot of memory redundancy for further adding the decoder. The down-sampling unit adopts the convolutional layer with a stride of 2, which reduces the size of the feature mapping while extracting the feature, and avoids the feature loss caused by using the pooling layer.

[0136] In the decoder, the present disclosure enlarges the image by 36 times through five FBU blocks in the steps 4 and 6, and each FBU block can enlarge the image by twice. Each FBU block contains one FBU layer, an instance normalization layer, and one leaky ReLU activation layer. However, using the up-sampling method alone is easy to blur a generated image, and a deep feature map loses many low-level features related to an image contour and line because of highly abstract features. Based on successful experience of the U-Net, the present disclosure alleviates this problem through multi-scale feature fusion. After the FBU block enlarges a deep feature mapping, an enlarged deep feature mapping is fused with a feature mapping of a same size in the encoder. By fusing the feature map in down-sampling level by level, the present disclosure gradually recovers the low-level feature in the feature map, making the texture and the contour of the image more accurate.

101371 The present disclosure designs another independent output layer to predict a quantity of cells. Based on an idea of a fully convolutional network, the present disclosure uses a 1x1 convolutional layer. In addition, the present disclosure establishes the global average pooling layer, such that the network overcomes a defect that a traditional network can only use a fixed-size input image when using a fully connected layer, and can accommodate input images of different sizes. This direct prediction method overcomes a limitation of dataset counting based on a point label, such that the network in the present disclosure can be trained using a dataset with only a label indicating the total object quantity.

101381 In order to enable those skilled in the art to better understand the present disclosure, the following describes a specific use process of this embodiment in combination with FIG. 1 and FIG. 5.

[0139] Cell counting and segmentation in a cell microscope image: 101401 Step 1: Perform image preprocessing.

[0141] Training images are processed based on a same size. A size of the image can be appropriately reduced to ease operational burden of a computer. Likewise, a truth value of the image is processed to have a same size as the training image, such that input images one-to-one correspond to truth values of the images. In order to simplify operation, the present disclosure processes an image of a dataset based on a size of 960x960.

[0142] Step 2: Input the training image into an encoder for feature extraction.

101431 Step 3: In the encoder, process the image successively by using six down-sampling modules to obtain a deepest-layer feature map that is reduced by 64 times.

[0144] Step 4: Send the deepest-layer feature map to a counting layer to predict a quantity of cells in the whole input image.

[0145] Step 5: Send the deepest-layer feature map to an FBU module.

[0146] Step 6: Perform feature fusion on a feature map expanded by the up-sampling module and the feature map with a same size generated in a down-sampling process.

[0147] Step 7: Send a final feature map generated after processing by five FBU modules and feature fusion to a predictor, where in the predictor, one convolutional layer performs feature prediction on the final feature map, then one FBU module expands the final feature map to generate a predicted segmented feature map with a same size as the original image, and finally, a predicted cell segmentation image is output by a hyperbolic tangent activation function activation layer.

[0148] The above embodiments merely illustrate specific implementations of the present disclosure, and the description thereof is more specific and detailed, but is not to be construed as a limitation to the patentable scope of the present disclosure. It should be pointed out that those of ordinary skill in the art can further make variations and improvements without departing from the conception of technical solutions in the present disclosure. These variations and improvements all fall within the protection scope of the present disclosure.

Claims

WHAT IS CLAIMED IS: 1. A lightweight and efficient object segmentation and counting method based on a generative adversarial network (GAN), comprising the following steps: step 1: obtaining an input image: processing all input images based on a same size, and making truth values of the input images have a same size as a training image, such that the input images one-to-one correspond to the truth values of the images; step 2: sending a processed input image in the step 1 to a down-sampling encoder for feature extraction to obtain a deepest-layer feature map; step 3: sending the deepest-layer feature map to a counting layer to predict a quantity of objects in the whole input image; step 4: sending the deepest-layer feature map to a fold beyond-nearest up-sampling (FBU) module to obtain an expanded feature map; step 5: performing feature fusion on the deepest-layer feature map and the expanded feature map to obtain a first final feature map; step 6: sending the first final feature map to the FBU module in the step 4 as the deepest-layer feature map, repeatedly performing the steps 4 and 5 until a second final feature map meeting a requirement is obtained, and sending the second final feature map to a predictor to obtain a predicted segmented feature map; step 7: send both the predicted segmented feature map and the truth value of the image to a coordination discriminator, and using the coordination discriminator to perform learning and determine a difference between the predicted segmented feature map and a truth value chart of the image, so as to obtain a generator discrimination matrix; step 8: generating a verified valid matrix with a same size as the generator discrimination matrix and a value of 1, and a verified fake matrix with a value of 0; step 9: calculating a total training loss of a generator based on a truth value of the quantity of objects in a dataset, the truth value of the input image that is obtained in the step I, the quantity of objects that is obtained in the step 3, the predicted segmented feature map obtained in the step 6, the generator discrimination matrix obtained in the step 7, and the verified valid matrix generated in the step 8; step 10: sending the total training loss of the generator back to the generator network for iterative network updating and learning, and performing a round of training on the generator to obtain a trained generator; step 11: fusing the processed input image in the step 1 and the predicted segmented feature map obtained in the step 6 on an image channel to obtain a fused feature map, and sending the fused feature map to the coordination discriminator; step 12: processing, by four feature extraction convolution modules in the coordination discriminator, the fused feature map to obtain a first deep-layer discriminant feature map; step 13: inputting the first deep-layer discriminant feature map obtained in the step 12 into a structural feature determining layer composed of one convolutional layer to obtain a predicted fake discrimination matrix containing a structural difference; step 14: fusing the processed input image in the step 1 and a truth-value image on the image channel, sending a fused image to the coordination discriminator for processing by the four feature extraction convolution modules, to obtain a second deep-layer discriminant feature map, and then inputting the second deep-layer discriminant feature map into the structural feature detennining layer composed of one convolutional layer to obtain a predicted real discrimination matrix; step 15: calculating a total training loss of the coordination discriminator based on the verified valid matrix and the verified fake matrix that are obtained in the step 8, the predicted fake discrimination matrix obtained in the step 13, and the predicted real discrimination matrix obtained in the step 14; step 16: sending the total training loss obtained in the step 15 back to the network for iterative network updating and learning, performing a round of training on the coordination discriminator to obtain a trained coordination discriminator, and saving the generator obtained in the step 10 and the coordination discriminator obtained in this step; and step 17: repeating the steps 2 to 16 until a generator and a coordination discriminator that meet a predetermined condition are obtained.
2. The lightweight and efficient object segmentation and counting method based on a GAIN according to claim 1, wherein the down-sampling encoder described in the step 2 comprises six down-sampling modules, the first five down-sampling modules have a same structure and each comprise one convolutional layer with a stride of 2, one instance homogenization layer, and one linear rectifier function with leak (hereinafter referred to as leaky ReLU) activation layer, and the last down-sampling module comprises one convolutional layer, one random dropout layer, and one leaky ReLU
3. The lightweight and efficient object segmentation and counting method based on a GAN according to claim 1, wherein the counting layer in the step 3 comprises one global average pooling layer and one convolutional layer.
4. The lightweight and efficient object segmentation and counting method based on a GAIN according to claim 1, wherein the FBU module in the step 4 comprises one convolutional layer, two flatten layers, and two linear layers; after the deepest-layer feature map is sent to the FBU module, a newly added pixel needed for image expansion is generated in the image channel through operation at the convolutional layer; matrix flattening is performed once on a deepest-layer feature map with the newly added pixel, and after a flattened feature map is stretched into a linear vector, a linear mapping matrix reconstruction calculation is performed, in other words, the linear vector is arranged based on an expanded height and a width of the original deepest-layer feature map, such that the newly added pixel is transferred to a height of the original deepest-layer feature map to obtain the expanded feature map.
5. The lightweight and efficient object segmentation and counting method based on a GAIN according to claim 1, wherein the predictor described in the step 6 comprises a convolutional layer with a size of 4, a FBU module, and a hyperbolic tangent activation function activation layer, the convolutional layer with a size of 4 performs feature prediction on the second final feature map, the final feature map is expanded by the FBU module to generate the predicted segmented feature map with a same size as the input image described in the step 1, and then the hyperbolic tangent activation function activation layer activates the predicted segmented feature map to speed up training convergence and obtain a trained predicted segmented feature map.
6. The lightweight and efficient object segmentation and counting method based on a GAIN according to claim 1, wherein the total training loss of the generator in the step 9 is calculated according to the following steps: calculating a loss between the predicted segmented feature map and the truth value of the image by using an Li loss function, to obtain a loss of a segmentation task; calculating a loss between the quantity of objects in the step 3 and the truth value of the quantity of objects by using an L2 loss fiincti on, to obtain a loss of a counting task; calculating a loss between the generator discrimination matrix and the verified valid matrix by using the L2 loss function, to obtain a discrimination loss of the generator; and performing weighting and summation on the loss of the segmentation task, the loss of the counting task, and the discrimination loss of the generator to obtain the total training loss of the generator.
7. The lightweight and efficient object segmentation and counting method based on a GAN according to claim 1, wherein the step 12 comprises the following steps: step 12.1: supplementing a blank pixel with a size of 4 around the fused feature map; step 12.2: sending a processed fused feature map in the step 12.1 to one convolutional layer with a size of 8*8 and a stride of 2 for feature extraction and fusion with a large degree of perception; step 12.3: supplementing a blank pixel with a size of 3 around a processed fused feature map in the step 12.2; step 12.4: sending a processed fused feature map in the step 12.3 to one convolutional layer with a size of 6*6 and a stride of 2 for second feature extraction; step 12.5: supplementing a blank pixel with a size of 2 around a processed fused feature map in the step 12.4; and step 12.6: sending a processed fused feature map in the step 12.5 to two consecutive convolutional layers with a size of 4*4 and a stride of 2 for third feature extraction to obtain the first deep-layer discriminant feature map
8. The lightweight and efficient object segmentation and counting method based on a GAN according to claim 1, wherein in the step 13, the first deep-layer discriminant feature map obtained in the step 12 is sent to one convolutional layer with a size of 3*3 and a stride of 1 for structural feature determining to output the predicted fake discrimination matrix containing the structural difference.
9. The lightweight and efficient object segmentation and counting method based on a GAN according to claim 7 or 8, wherein after the calculation by the convolutional layer, instance homogenization and leaky ReLU activation need to be separately performed on a calculation result of the convolutional layer.
10. The lightweight and efficient object segmentation and counting method based on a GAN according to claim 1, wherein the step 15 comprises the following steps: calculating a loss between the predicted real discrimination matrix and the verified valid matrix and a loss between the predicted real discrimination matrix and the verified fake matrix by using an L2 loss function, and performing summation to obtain a predicted real discrimination loss; calculating a loss between the predicted fake discrimination matrix and the verified valid matrix and a loss between the predicted fake discrimination matrix and the verified fake matrix by using the L2 loss function, and performing summation to obtain a predicted fake discrimination loss; and calculating an average value of the predicted real discrimination loss and the predicted fake discrimination loss, and taking the average value as the total training loss of the coordination discriminator,