CN117058369A

CN117058369A - Weak supervision pig image example segmentation method, device and equipment based on Box Inst

Info

Publication number: CN117058369A
Application number: CN202310845480.0A
Authority: CN
Inventors: 王海燕; 江烨皓; 赵书红; 李新云; 刘小磊; 马云龙; 付玉华; 杜小勇; 黎煊
Original assignee: Huazhong Agricultural University
Current assignee: Huazhong Agricultural University
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-11-14

Abstract

The application discloses a method for segmenting a weak supervision pig image example based on a box Inst, which utilizes a trained box Inst segmentation model to segment the pig image example, wherein the training of the box Inst segmentation model further comprises the following steps: constructing a weak supervision data set; training the weak supervision data set by using the built BoxInst segmentation model to obtain a trained BoxInst segmentation model; the established box Inst segmentation model consists of a CotSENet backbone extraction network, an FPN layer, a controllerHead layer and a mask layer, wherein a feature map extracted by the CotSENet backbone extraction network is transmitted to the FPN layer for multi-scale fusion, target detection is carried out through the controllerHead layer, final pixel prediction is carried out through the mask layer, gradient back propagation is carried out through ProjectionLoss and PairwisetafinityLoss, and finally weak supervision example segmentation based on a boundary frame is realized. The method is used for dividing the pig in the pig image, reduces the cost of manufacturing the data set, and can divide the pig from the image.

Description

Weak supervision pig image example segmentation method, device and equipment based on Box Inst

Technical Field

The application belongs to the field of artificial intelligence, and particularly relates to a method, a device and equipment for dividing a weak supervision pig image instance based on a box Inst.

Background

Instance segmentation is an image segmentation technique that assigns each pixel in an image to a particular object instance. In instance segmentation, different object instances of the same class are given different labels, unlike semantic segmentation. Semantic segmentation assigns each pixel in an image a class, but objects between the same classes are not distinguished. The example segmentation has two functions of target detection and pixel semantic classification, and can segment a specific target in an image. In practical application, pigs can be extracted from images by training an example segmentation model, and the method has important research significance in the tasks of pig counting, behavior recognition, body weight and body size measurement and the like.

The attention mechanism is a processing mechanism imitating human vision, is a signal processing mechanism and can help a machine learning model to automatically learn and calculate the contribution of input data to output data. The essence of the attention mechanism is to locate the information of interest, suppress the useless information, and the result is usually shown in the form of a probability map or a probability feature vector. For computer vision problem research, the attention mechanism is to obtain a weight distribution through learning, and then apply the weight distribution to the original features to obtain more detail information of the required attention target, while suppressing other useless information.

In the field of image segmentation based on deep learning technology, a large-scale and fine-labeled data set is often required to train a segmentation model, so that the accuracy and generalization performance of a final model are ensured. However, there is still a lack of large, well-annotated datasets with authority for research use in the current field of pig segmentation. The main reason is that it takes a lot of manpower costs to make this data set. Currently, to address this problem, most researchers have manually labeled pig images by collecting them, and making small datasets for only a single downstream task, which results in a trained model that is not generalizable and is far from the final application level.

Disclosure of Invention

In order to overcome the defects in the prior art, the application provides a method, a device and equipment for dividing a weak supervision pig image example based on a box Inst, which are used for dividing pigs in a pig image, and can divide the pigs from the image while reducing the cost of manufacturing a data set.

According to an aspect of the present application, a method for segmenting a weak supervision pig image based on a box inst is provided, and a trained box inst segmentation model is used for segmenting a pig image, wherein training of the box inst segmentation model further includes:

constructing a weak supervision data set;

training the weak supervision data set by using the built BoxInst segmentation model to obtain a trained BoxInst segmentation model; the established box Inst segmentation model consists of a CotSENet backbone extraction network, an FPN layer, a Controller Head layer and a Mask Branch layer, wherein a feature map extracted by the CotSENet backbone extraction network is transmitted to the FPN layer for multi-scale fusion, target detection is carried out through the Controller Head layer, final pixel point prediction is carried out through the Mask Branch layer, gradient back propagation of parameters in the segmentation model is carried out through a project Loss function (Mask horizontal and vertical Projection Loss functions) and a Mask color similarity Loss function (Pairwise affinity Loss), and finally weak supervision example segmentation based on a boundary frame is realized.

As a further technical solution, constructing a weakly supervised dataset, further comprising:

1.1 The pig farm monitoring video data is adopted, and images obtained by framing are screened to construct a pig image segmentation dataset;

1.2 Performing weak supervision labeling on the data set constructed in the step 1.1), labeling all pigs in the same pig house in the image in a rough outline labeling mode, giving different numbers to each pig, storing json files and pictures formed by labeling in the same folder, and adopting the same naming mode; randomly selecting 50 images from the test set, and carrying out pixel-level labeling for verifying a model training result;

1.3 After the labeling in the step 1.2) is completed, all the labeled json files and the original images are converted into a COCO data set format, the data set after the original labeling is divided into a training set, a verification set and a test set according to the proportion of 8:1:1, and three json files of train, test, json and valid.

As a further technical solution, the method further includes: and verifying the effectiveness of the trained model by using the image in the pixel-level marked data set, and measuring the performance effect of the trained model weight on the test set by using the Semg_mAP index.

As a further technical scheme, the execution flow of the rough contour labeling mode is as follows:

2.1 Using LabelMe labeling software to open the image storage catalog;

2.2 Marking pigs in the image, adopting a dot line format to ensure that the dots are positioned on a minimum circumscribed matrix of the pigs, and roughly covering the outline of the pig body, wherein each pig only uses 5 dots;

2.3 A tagged json file is generated and the image is placed in the same file directory.

As a further technical solution, constructing a CotSENet as a backbone network for feature extraction includes the following steps:

4.1 The feature extraction is divided into 5 stages respectively, and the maximum pooling with convolution kernel sizes of 7 and 3*3 is used for preprocessing the input image in the first stage;

4.2 The CotSE residual characteristic extraction blocks are constructed, and 3, 4, 6 and 3 CotSE residual characteristic extraction blocks are respectively connected in series in the second, third, fourth and fifth stages to serve as characteristic extraction element units of a backbone network.

As a further technical solution, constructing the CotSE residual feature extraction block includes the following steps:

5.1 A convolution kernel of size 1*1 is used to reduce the initial input map feature channel by a factor of 4;

5.2 Learning the feature map using a CotSE module;

5.3 Using a convolution kernel of size 1*1 to re-scale the dimensions of the learned feature map to the input size;

5.4 Adding the initial input and the learned feature map in the same dimension, and outputting a final feature map.

As a further technical scheme, the operation of the characteristic diagram by using the CotSE module comprises the following flow:

6.1 Using average pooling and maximum pooling to reduce H x W x C sized feature map channels to 1x C sized feature vectors;

6.2 Learning the feature vectors in the step 6.1) by using a full connection layer, adding the two feature vectors, obtaining a final channel attention weight matrix by using a sigmoid () activation function, multiplying the final channel attention weight matrix by an initially input feature channel, and outputting a feature map;

6.3 According to the feature map output in 6.2), obtaining a k matrix and a v matrix in the self-attention mechanism by using convolution with the size of 1*1, extracting the context information of a single point of the feature map by using convolution with the size of 3*3, splicing the context information with the feature map which is initially input, adopting q matrix in the self-attention mechanism obtained by the convolution of 1*1 to the spliced feature map, multiplying the q matrix and the k matrix, and multiplying the q matrix and the v matrix to obtain a new feature map;

6.4 Adding the feature maps obtained in 6.2) and 6.3), and outputting a final feature map.

According to an aspect of the present application, there is provided a device for segmenting a weak supervision pig image based on a BoxInst, and the device uses a trained BoxInst segmentation model to segment a pig image, wherein the training of the BoxInst segmentation model further includes:

constructing a weak supervision data set;

training the weak supervision data set by using the built BoxInst segmentation model to obtain a trained BoxInst segmentation model; the established Box Inst segmentation model consists of a CotSENet backbone extraction network, an FPN layer, a Controller Head layer and a Mask Branch layer, a feature map extracted by the CotSENet backbone extraction network is transmitted to the FPN layer for multi-scale fusion, target detection is carried out through the Controller Head layer, final pixel point prediction is carried out through the Mask Branch layer, reverse propagation is carried out through project Loss and Pairwise affinity Loss, and finally weak supervision instance segmentation based on a boundary frame is achieved.

According to an aspect of the present application, there is provided an electronic device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the BoxInst-based method for segmentation of image instances of weakly supervised pigs.

According to an aspect of the present description, there is provided a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the BoxInst-based method for segmentation of a weakly supervised pig image instance.

Compared with the prior art, the application has the beneficial effects that:

1. the application provides a method for performing instance segmentation training by using a weakly supervised dataset, and constructs a CotSENet backbone extraction network and a box Inst segmentation model for the dataset to perform dataset training, and judges the final training round number and effect through Semg_mAP values, thereby solving the problems of quite troublesome and cost-consuming production of a trained dataset in the instance segmentation field mainly comprising a deep learning method. Meanwhile, the whole set of pig image example segmentation method provided by the application realizes extraction of pigs in the image in a low-cost mode.

2. The application provides a CotSE module from the view point of image feature extraction by combining a channel attention mechanism and a CoT attention mechanism module, and constructs a CotSENet backbone network by using a ResNet-50 backbone network feature extraction framework; the CotSENet realizes the larger-range learning of the image, not only can establish global, local and long-range perceptibility for each feature extraction point, but also can strengthen important feature channels, increase the accuracy of feature extraction, finally realize the learning of different pigs in the image, and solve the problem of poor segmentation effect of the pigs in the image under the complex conditions of aggregation, shielding and the like.

Drawings

Fig. 1 is a flowchart of a box inst-based weak supervision pig image example segmentation in accordance with an embodiment of the application.

FIG. 2 is a schematic diagram of coarse labeling and fine labeling according to an embodiment of the application.

FIG. 3 is a schematic diagram of a rough labeling detailed description according to an embodiment of the application.

Fig. 4 is a schematic diagram of a CotSE module architecture according to an embodiment of the application.

FIG. 5 is a schematic diagram of a channel attention mechanism according to an embodiment of the present application.

FIG. 6 is a schematic diagram of a training segmentation model according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a division structure according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Currently, for the field of image instance segmentation, research has focused on training neural network models without relying on pixel-level labeling. By means of the algorithm of traditional image processing, the concept of gradient back propagation in deep learning is fused, and the training result of a data set which is not inferior to pixel-level labeling can be obtained by the deep learning model only by means of rough labeling (such as points, bounding boxes and the like). Based on the method, the method and the device improve the BoxInst example segmentation model by establishing a weakly supervised data set, and finally realize pig image example segmentation.

The application uses a channel attention mechanism which can help a Convolutional Neural Network (CNN) to better learn image characteristics, and adaptively adjusts the weight of each channel by learning the importance of each channel, thereby improving the performance of the model. In the image segmentation task, the channel attention mechanism can help the model to better identify important features in the image, reduce redundant features and improve the robustness of the model.

The technical conception of the application is as follows: firstly, a pig example segmentation weak supervision dataset marked by roughness is manufactured, secondly, a cotsend backbone network is constructed to extract information of images, a box Inst segmentation model is used for training, and finally, semg_mAP indexes are used for evaluating training results of the model.

The weak supervision pig image example segmentation method based on the Box Inst comprises data set making and model training.

FIG. 1 is a schematic general flow diagram of the method, mainly comprising two major modules of data set making and model training, wherein the data set making module comprises image acquisition, image annotation, format conversion and image pig fine annotation; the model training module comprises a segmentation model construction and training ending condition judgment.

The method comprises the following specific implementation steps:

step I, data set construction

The data set was made here for training by box inst using a rough labeling scheme, with the following specific steps:

s1, pig farm monitoring video data is adopted, and images obtained by framing are screened to construct a pig image segmentation dataset;

s2, marking pigs in the image by using LabelMe marking software, and ensuring that the dots are positioned on a minimum circumscribed matrix of the pigs by adopting a dot line format.

S3, marking json files according to 8:1: the scale of 1 is divided into a training set, a validation set and a test set and converted into a COCO dataset format.

S4, randomly selecting 50 pig images from the test set, and carrying out pixel-level fine labeling for finally checking the model effect.

Specifically, the constructed data set is subjected to weak supervision labeling, all pigs in the same pig house in the image are labeled in a rough outline labeling mode shown in fig. 2 and 3, different numbers are given to each pig, json files and pictures formed by labeling are stored in the same folder, and the same naming mode (only the file format suffixes are different) is adopted.

After the labeling is completed, all the labeling json files and the original images are converted into a COCO data set format, and as shown in fig. 3, the upper left corner coordinates and the length of the bounding box fields in the COCO data set can be obtained by solving the maximum value and the minimum value of each point in the rough labeling. Dividing the original marked data set into a training set, a verification set and a test set according to the proportion of 8:1:1, respectively forming trainjson, testjson and validjson three json files and a folder only containing pig images (the folder contains all pig images), and finally randomly selecting 50 images from the test set to carry out pixel-level fine marking, as shown in figure 2.

Step II, training and judging by using a BoxInst segmentation model

Wherein the box inst consists of a Backbone (Backbone extraction network), FPN layer and Controller Head layer, mask Branch layer and loss function, the model of which is shown in fig. 6. The whole segmentation model is divided into the following modules:

1) Constructing a CotSENet backbone network

The CotSENet backbone network mainly takes ResNet-50 as a framework, and the CotSE module is used for replacing convolution operation of the original 3*3, so that the extraction process of the image features of the pigs is optimized by using the proposed CotSE; specifically, a self-attention mechanism Cot mechanism and a channel attention mechanism for connecting the context information are introduced, a long-range relation between pixels is established in the process of extracting the characteristics of the image, redundant characteristics are solved, and finally the accuracy of extracting the characteristics is improved.

The CotSE includes a channel domain attention mechanism and a Cot module.

(1) Channel attention mechanism

The channel attention mechanism learns and gives weight distribution to different channels according to the importance difference of the different channels, focuses on important characteristic channels, weakens the influence of other characteristics, thereby achieving the purpose of improving the network performance, and gives weight distribution to the obtained characteristic map based on the channels again through three steps of operation.

The method comprises the following steps:

the first step: extrusion operation (Squeeze). The two-dimensional feature (h×w) of each channel is compressed into 1 real number by Global Pooling (Global Pooling), which is a feature compression in the spatial dimension, and because this real number is calculated from all values of the two-dimensional feature, it has a Global receptive field to some extent, and the number of channels remains unchanged, so it becomes 1×1×c after the squeeze operation. The specific operation formula is as follows:

in the formula, F _sq (. Cndot.) is the extrusion function, W, H, C is the width, height and channel number of the feature map to be processed, u _c (i, j) is the element whose c-th layer channel coordinates are (i, j) of the feature map, z _c Representing the extruded output characteristics of the c-th layer channel. After the extrusion operation, a one-dimensional tensor with the same length and channel number is formed.

And a second step of: excitation operation (specification). And generating a weight value for each characteristic channel through the parameter W, and outputting the weight values with the same number as the input characteristics. The specific operation formula is as follows:

s＝F _ex (z，W)＝σ(g(z，W))＝σ(W ₂ δ(W ₁ z))

wherein F is _ex Representing the extraction operation, z is the output of the squeeze operation and is a tensor of size 1×1×c. W (W) ₁ And W is ₂ Is a weight, whereinHere r is a scaling parameter in order to reduce the number of channels and thus the amount of computation. Delta represents a ReLU activation function, and sigma represents a Sigmoid activation function. Starting from the last equal sign, use W first ₁ Multiplying z is a fully connected operation, and the dimension of the multiplied result is +.>Then passing through a ReLU layer, outputting unchanged dimension, and then combining with W ₂ Multiplication is a fully connected process, the output dimension becomes 1×1×c, and s is obtained through a Sigmoid function. This s is the core of the SE module, which is used to characterize the weights of the feature map, and this weight is learned through the fully connected and nonlinear layers above.

And a third step of: feature recalibration (Scale). The last step is to weight the normalized weight value obtained by the previous step to each channel characteristic, multiply the weight coefficient channel by channel, and finish introducing the attention mechanism in the channel dimension. The specific operation formula is as follows:

wherein F is _scale (. Cndot.) represents the identification function,represents the c-th layer channel characteristics in the result, s _c Represents the c-layer channel weight, u _c Representing the characteristics of the c-th layer channel of the input feature map.

(2) Cot module

Most existing Transformer-based architectural designs now act directly on 2D feature graphs, obtaining an attention matrix (independent query points (queries) and all keys (keys)) by using self-attention, but do not fully exploit the rich context between adjacent keys. In order to fully utilize the context information of the pixel points, the CoT proposes to use the characteristic diagram after convolution and the original characteristic diagram to splice and then to operate a self-attention mechanism, and the design fully utilizes the context information among input keys to guide the learning of a dynamic attention matrix, so that the visual representation capability is enhanced. Technically, the CoT first context encodes the input key by a 3x3 convolution, producing a static context representation of the input. Further, the encoded keys are connected to the input query, and the dynamic multi-headed attention matrix is learned by two consecutive 1x1 convolutions. Finally, the learned attention moment array is multiplied by the input value to achieve a dynamic context representation of the input, and the static and dynamic context representations are fused as the final output. The calculation details are as follows:

an input feature map X of size h×w×c is given. key, queries and values are defined as k=x, q=x, v=xw, respectively _v . The CoT first spatially context codes all adjacent keys within the kxk grid to obtain a context Wen Jian K1 εR ^H×W×C Which reflects static context information between local neighbor keys. Next, the context key K is entered ¹ And Q is equal toSpliced (Concat) and then the attention matrix is obtained by two consecutive 1x1 convolutions:

A＝[K ¹ ，Q]W _θ W _δ

where a represents an aggregate contextual attention matrix, the local attention matrix for each spatial location of which is learned based on query features and contextual key features, rather than simple query-key pairs. Next, a weighted feature map K is obtained by aggregating A with value V ²

K ² ＝V×A

The final output of the CoT is static context K ¹ And dynamic context K ² Is fused with (2)

(3) CotSE module

The application constructs a CotSE module which combines the two mechanisms, under which the feature map passes through the two modules, the channel attention model, the CoT, and then the reconstructed feature map. The CotSE adopts the thought similar to the attention mechanism of people, re-assigns weights of the feature graphs through continuous self-learning to attach importance to the feature with large weight to inhibit useless features, introduces the self-attention thought, establishes long-range relation for single pixel points and performs association analysis with other pixels, and guides the network to learn more accurate features by means of context information. The channel attention module of the CotSE respectively carries out maximum pooling and average pooling on the input feature map through the width and the height, respectively selects the maximum value and the average value of each channel to be reserved, compresses the input feature map in the width and the height dimensions, then respectively sends the features to a full connection layer consisting of two full connection layers and one sigmoid layer, distributes weights for each channel, and finally multiplies the weights with the channel which does not pass through the channel attention module, wherein the important feature channels are enhanced and the unimportant channels are suppressed. The output profile of the channel attention module is then used as the input profile of the CoT module, which first context codes the input keys by 3x3 convolution, resulting in a static contextual representation of the input. Further, the encoded keys are connected to the input query, and the dynamic multi-headed attention matrix is learned by two consecutive 1x1 convolutions. Finally, the learned attention moment array is multiplied by the input value to achieve a dynamic context representation of the input, and the static and dynamic context representations are fused as the final output, resulting in a final output profile.

In one embodiment, constructing a CotSENet as a backbone network for feature extraction specifically includes the following steps: the feature extraction is respectively carried out in 5 stages, and the maximum pooling with convolution kernel sizes of 7 and 3*3 is used for preprocessing an input image in the first stage; and constructing a CotSE residual characteristic extraction block, and respectively connecting 3, 4, 6 and 3 CotSE residual characteristic extraction blocks in series in the second, third, fourth and fifth stages as a characteristic extraction element unit of the backbone network.

The construction of the CotSE module is shown in fig. 4, and specifically includes the following steps:

a convolution check with a size of 1*1 is used to reduce the initial input diagram feature channel by a factor of 4; as shown in fig. 5, the feature map channels after the dimension reduction are reduced to feature vectors with the size of 1x c by using average pooling and maximum pooling; learning the feature vectors by using a full connection layer, adding the two feature vectors, obtaining a final channel attention weight matrix by using a sigmoid () activation function, and multiplying the final channel attention weight matrix by an initially input feature channel; for the feature map output in the last step, obtaining a k matrix and a v matrix in a self-attention mechanism by using convolution with the size of 1*1, extracting context information of single points of the feature map by using convolution with the size of 3*3, splicing the context information with the feature map input initially, adopting q matrix in the self-attention mechanism obtained by the convolution of 1*1 to the spliced feature map, multiplying the q matrix and the k matrix and multiplying the v matrix, and finally obtaining a new feature map; finally, the feature map dimension is re-upscaled to the initial input size using a convolution kernel of size 1*1; the initial input and the learned feature map are added in the same dimension, and the feature map is finally output.

2) Fpn (characteristic pyramid)

As shown in fig. 6, the feature map () extracted in the third, fourth and fifth stages of the CotSENet backbone network is taken as the input of Fpn and is marked as C1, C2 and C3; the method comprises the steps of performing downsampling on C3 and C2 from top to bottom by 2 times, performing feature channel learning and scaling by using a convolution kernel with the size of 1*1, adding downsampled C3 and C2, performing feature channel learning and scaling by using a convolution kernel with the size of 1*1 to obtain P2, adding downsampled C2 and C1, performing feature channel learning and scaling by using a convolution kernel with the size of 1*1 to obtain P1, and performing feature channel learning and scaling by using a convolution kernel with the size of 1*1 twice on C3 to obtain P3; and (3) up-sampling the obtained P3 by 2 times to obtain P4 and P5 respectively.

3) Mask Branch layer

As shown in fig. 6, the P2 and P3 feature maps in Fpn are interpolated by bilinear interpolation of features and then added together with P1 as inputs to a mask; the feature map in 2) is convolved with a 3x3 size with 4 channels of 128, and then convolved with a 1x1 size with 8 channels to obtain a feature map with a size of 1/8 of the original image and a channel number of 8, which is denoted as F _mask 。

4) Controller Head layer

As shown in fig. 6, P1, P2, P3, P4, P5 in Fpn are used as shared feature parameters, and the position and shape parameters of different examples in the feature map are obtained by using convolution operation; using convolution operations to obtain relative coordinates maps, i.e. F in Mask Branch layer _mask Relative position coordinates relative to the position (x, y) of the current example, relative coordinates maps and F are finally calculated _mask Splicing; the feature map output in the Mask Branch layer adopts a lightweight FCN network, comprises three 1x1 convolutions with the channel number of 8 and the size of ReLU (), and finally uses sigmoid (two classes) to predict the Mask of each different instance.

5) Loss function

As shown in fig. 6, when the Loss function is the focus of weak supervised learning by the BoxInst segmentation model, the two important Loss functions are mainly counter-propagated, namely, the project Loss (mask horizontal and vertical Projection Loss functions) and Pairwise affinity Loss (pixel color similarity Loss functions), wherein the project Loss explicitly constrains the predicted mask within the bounding box of the object by matching the projections of the predicted instance mask with the labeled bounding box in both directions. Since the output in the Mask Branch layer can be regarded as the probability that each pixel point is the foreground, the color similarity between each point in the bounding box and 8 points around it (within the bounding box) is calculated, the point with similarity greater than τ is taken as the positive sample, and the BCE loss calculation is performed using the obtained probability.

Of the above 5 modules, the following description is made for the los function and Semg_mAP:

1) ProjectionLoss (mask horizontal and vertical Projection Loss function)

In the process of data set production, a minimum circumscribed rectangle of each instance can be obtained in a point-line mode and is marked as a box (by matching the projection of the predicted instance mask and the marked boundary frame in two directions, the predicted mask is explicitly restrained in the boundary frame of the target), and the marking is mainly used for positioning pigs in the image. To introduce this constraint, from a projection perspective, assuming b is a box mask, its projection in the x-axis as well as the y-axis can be expressed as follows:

Proj _x (b)＝max _y (b)＝l _x

Proj _y (b)＝max _x (b)＝l _y

max in _* The code form may be expressed as a maximum pooling of b along the x-axis as well as the y-axis. A mask representing a predicted instance of the network output. Constraining the projection distance, the final loss can be expressed as:

in the middle ofRepresenting pre-emphasisThe mask measured is projected on the x-axis, < >>The mask representing the prediction is projected on the y-axis, where L may use either the Dice Loss or the BCE cross entropy.

2) Pairwise affinity Loss pixel color similarity loss function

The Projection constraint of using only this Projection Loss is not sufficient to obtain a better mask, for which a further constraint is introduced from the similarity between pixels, a pa-wise attribute relationship. In other words, since the output of the mask head in the previous step can be regarded as a probability that each pixel point is a foreground, the color similarity of each point in the bounding box and 8 points around it (within the bounding box) is calculated, a point with a similarity greater than τ is taken as a positive sample, and BCE loss calculation is performed with the obtained probability. Similar to the method commonly used in weakly supervised semantic segmentation, a simple topology graph g= (V, E) is built for each feature point within its eight neighborhoods. Defining a label y for each drawing edge e _e =1 (when nodes on the edge are the same label), otherwise 0.

In the prediction, y _e The probability of =1 is obtained by the following formula:

in the method, in the process of the application,for a pixel point on the mask, < +.>Is its neighboring pixel point.

Further, the paperloss is as follows:

wherein: e (E) _in For one of themThe pixels of the drawing edges are combined at the edges inside the box.

In the training back propagation process, the label is constructed by using the similarity of colors in the LAB color space, the label of the drawing edge is determined, and when the colors among pixels meet the classification requirement, the label can be judged to be a part of the same target.

3)Semg_mAP

The use of Semg_mAP specifically comprises the following procedures:

1) Labeling the original image in a fine labeling mode, and taking the labeled mask as a true value; taking threshold values of 0.5 to 0.95 step length of 2 respectively, and determining the IOU values (cross-over ratio) of the prediction boundary box and the true annotation boundary box of each instance;

2) Calculating TP, FP and FN values, and calculating precision and recovery by using the three values;

3) Drawing a P-R curve through precision values and recovery values under different thresholds, solving the curve area through a calculus formula to obtain an AP value, and averaging the AP values of the bounding boxes of all the examples to obtain Semg_mAP;

4) And verifying by using 50 images marked at the pixel level, checking the Semg_mAP value of the test set marked finely after each round of training, and stopping training when Semg_mAP is more than 65%, so as to obtain a final model. The test image of the final model is shown in fig. 7.

Through the steps, the pig image example segmentation task can be completed in a low-cost mode, and convenience is brought to downstream analysis.

The application also provides a device for segmenting the image instance of the weak supervision pig based on the box Inst, which utilizes a trained box Inst segmentation model to segment the image instance of the pig, wherein the training of the box Inst segmentation model further comprises the following steps:

constructing a weak supervision data set;

The device may be implemented by adopting the embodiment of the foregoing method, which is not described herein.

The application also provides electronic equipment which can be an industrial personal computer, a server or a computer terminal. The electronic equipment comprises a processor, a memory and a computer program stored on the memory and executable by the processor, wherein the computer program realizes the steps of the weak supervision pig image instance segmentation method based on the box Inst when being executed by the processor.

The electronic device comprises a processor, a memory, and a network interface connected via a system bus, wherein the memory may be

Including non-volatile storage media and internal memory. The non-volatile storage medium may store an operating system and a computer program. The method comprises

The computer program comprises program instructions which, when executed, cause the processor to perform the steps of any of the box inst-based methods of image instance segmentation for weakly supervised pigs.

The processor is used to provide computing and control capabilities to support the operation of the entire electronic device. The internal memory is a nonvolatile memory

The execution of a computer program in a storage medium provides an environment that, when executed by a processor, causes the processor to perform the steps of any of the box inst-based methods of image instance segmentation for weakly supervised pigs.

The network interface is used for network communication such as transmitting assigned tasks and the like. It is to be appreciated that the processor may be a central processing unit (CentralProcessingUnit, CPU), and that the processor may also be other general purpose processors, digital

A signal processor (DigitalSignalProcessor, DSP), an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), a Field programmable gate array (Field-ProgrammableGateArray, FPGA) or other programmable logic device, a discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The application also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the steps of the weak supervision pig image example segmentation method based on the box Inst are realized when the computer program is executed by a processor.

The innovation point of the application is that:

1) Weak supervision data set based on rough labeling and weak supervision pig image example segmentation method based on Box Inst

In the field of instance segmentation, where deep learning methods predominate, it is quite cumbersome and costly to produce a training dataset. In order to solve the problem, the application provides a method for performing instance segmentation training by using a weak supervision data set, and constructs a CotSENet backbone extraction network and a box Inst segmentation model for the data set to perform data set training, and judges the final training round number and effect through a Semg_mAP value. A whole set of pig image example segmentation method is provided, and pig extraction in the image is realized in a low-cost mode.

2) Self-attention mechanism CotSE module integrating context information and channel information

Because pigs in the image only have aggregation, shielding and other conditions, and different pigs in the image only have a certain spatial position relationship; these all present significant challenges for training of example segmentation models based on deep learning. Based on the method, from the point of view of image feature extraction, the application combines a channel attention mechanism and a CoT attention mechanism module to provide a CotSE module, and replaces 3*3-size convolution operation in ResNet-50 to construct a CotSENet backbone network; the CotSENet realizes the larger-range learning of the image, not only can establish global, local and long-range perceptibility for each feature extraction point, but also can strengthen important feature channels, increase the accuracy of feature extraction, finally realize the learning of different pigs in the image, and solve the problem of poor segmentation effect of the pigs in the image under the complex conditions of aggregation, shielding and the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present application.

Claims

1. The weak supervision pig image example segmentation method based on the box Inst is characterized by comprising the steps of utilizing a trained box Inst segmentation model to segment pig image examples, wherein the training of the box Inst segmentation model further comprises the following steps:

constructing a weak supervision data set;

training the weak supervision data set by using the built BoxInst segmentation model to obtain a trained BoxInst segmentation model; the established box Inst segmentation model consists of a CotSENet backbone extraction network, an FPN layer, a ControllerHead layer and a mask layer, wherein a feature map extracted by the CotSENet backbone extraction network is transmitted to the FPN layer for multi-scale fusion, target detection is carried out through the ControllerHead layer, final pixel point prediction is carried out through the mask layer, reverse propagation is carried out through a mask level and vertical projection loss function ProjectionLoss and a pixel color similarity loss function Pairwise affinityLoss, and weak supervision example segmentation based on a bounding box is finally realized.

2. The box inst-based weak supervision pig image instance segmentation method according to claim 1, further comprising the steps of:

2.1 The pig farm monitoring video data is adopted, and images obtained by framing are screened to construct a pig image segmentation dataset;

2.2 Performing weak supervision labeling on the data set constructed in the step 2.1), labeling all pigs in the same pig house in the image in a rough outline labeling mode, giving different numbers to each pig, storing json files and pictures formed by labeling in the same folder, and adopting the same naming mode; randomly selecting 50 images from the test set, and carrying out pixel-level labeling for verifying a model training result;

2.3 After the labeling in the step 2.2), all the labeled json files and the original images are converted into a COCO data set format, the data set after the original labeling is divided into a training set, a verification set and a test set according to the proportion of 8:1:1, and three json files of train, test, json and valid.

3. The box inst-based weak supervision pig image instance segmentation method according to claim 2, further comprising: and verifying the effectiveness of the trained model by using the image in the pixel-level marked data set, and measuring the performance effect of the trained model weight on the test set by using the Semg_mAP index.

4. The method for segmenting the weak supervision pig image example based on the box Inst according to claim 2, wherein the execution flow of the rough contour labeling mode is as follows:

4.1 Using LabelMe labeling software to open the image storage catalog;

4.2 Marking pigs in the image, adopting a dot line format to ensure that the dots are positioned on a minimum circumscribed matrix of the pigs, and roughly covering the outline of the pig body, wherein each pig only uses 5 dots;

4.3 A tagged json file is generated and the image is placed in the same file directory.

5. The weak supervision pig image instance segmentation method based on box inst according to claim 1, wherein constructing CotSENet as a backbone network for feature extraction comprises the following procedures:

5.1 The feature extraction is divided into 5 stages respectively, and the maximum pooling with convolution kernel sizes of 7 and 3*3 is used for preprocessing the input image in the first stage;

5.2 The CotSE residual characteristic extraction blocks are constructed, and 3, 4, 6 and 3 CotSE residual characteristic extraction blocks are respectively connected in series in the second, third, fourth and fifth stages to serve as characteristic extraction element units of a backbone network.

6. The weak supervision pig image instance segmentation method based on the box inst according to claim 1, wherein constructing the CotSE residual feature extraction block comprises the following steps:

6.1 A convolution kernel of size 1*1 is used to reduce the initial input map feature channel by a factor of 4;

6.2 Learning the feature map using a CotSE module;

6.3 Using a convolution kernel of size 1*1 to re-scale the dimensions of the learned feature map to the input size;

6.4 Adding the initial input and the learned feature map in the same dimension, and outputting the final learned feature map.

7. The weak supervision pig image instance segmentation method based on box inst according to claim 1, wherein the operation of the feature map using the CotSE module comprises the following steps:

7.1 Using average pooling and maximum pooling to reduce H x W x C sized feature map channels to 1x C sized feature vectors;

7.2 Learning the feature vectors in 7.1) by using a full connection layer, adding the two feature vectors, obtaining a final channel attention weight matrix by using a sigmoid () activation function, multiplying the final channel attention weight matrix by an initially input feature channel, and outputting a feature map;

7.3 According to the feature map output in 7.2), obtaining a k matrix and a v matrix in the self-attention mechanism by using convolution with the size of 1*1, extracting the context information of a single point of the feature map by using convolution with the size of 3*3, splicing the context information with the feature map which is initially input, adopting q matrix in the self-attention mechanism obtained by the convolution of 1*1 to the spliced feature map, multiplying the q matrix and the k matrix, and multiplying the q matrix and the v matrix to obtain a new feature map;

7.4 Adding the feature maps obtained in 7.2) and 7.3), and outputting a final feature map.

8. The device for segmenting the weak supervision pig image example based on the box Inst is characterized by utilizing a trained box Inst segmentation model to segment the pig image example, wherein the training of the box Inst segmentation model further comprises the following steps:

constructing a weak supervision data set;

training the weak supervision data set by using the built BoxInst segmentation model to obtain a trained BoxInst segmentation model; the established box Inst segmentation model consists of a CotSENet backbone extraction network, an FPN layer, a ControllerHead layer and a mask layer, a characteristic diagram extracted by the CotSENet backbone extraction network is transmitted to the FPN layer for multi-scale fusion, target detection is carried out through the ControllerHead layer, final pixel point prediction is carried out through the mask layer, and back propagation is carried out through a mask level and vertical projection loss function ProjectionLoss and a pixel color similarity loss function PairwisefacienyLoss, so that weak supervision example segmentation based on a boundary frame is finally realized.

9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor performs the steps of the BoxInst-based weakly supervised pig image instance segmentation method of any of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the BoxInst-based weak supervision pig image instance segmentation method according to any one of the claims 1 to 7.