CN111259936B - Image semantic segmentation method and system based on single pixel annotation - Google Patents

Image semantic segmentation method and system based on single pixel annotation Download PDF

Info

Publication number
CN111259936B
CN111259936B CN202010023166.0A CN202010023166A CN111259936B CN 111259936 B CN111259936 B CN 111259936B CN 202010023166 A CN202010023166 A CN 202010023166A CN 111259936 B CN111259936 B CN 111259936B
Authority
CN
China
Prior art keywords
category
similarity
image
pixel
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010023166.0A
Other languages
Chinese (zh)
Other versions
CN111259936A (en
Inventor
马惠敏
李熹
储华珍
陈衍先
易生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
University of Science and Technology Beijing USTB
Original Assignee
Tsinghua University
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, University of Science and Technology Beijing USTB filed Critical Tsinghua University
Priority to CN202010023166.0A priority Critical patent/CN111259936B/en
Publication of CN111259936A publication Critical patent/CN111259936A/en
Application granted granted Critical
Publication of CN111259936B publication Critical patent/CN111259936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The invention provides an image semantic segmentation method and system based on single pixel labeling, wherein the method comprises the following steps: respectively coding each category by using the apparent characteristics and the semantic characteristics based on the label of the single pixel of each category; calculating the similarity of each super pixel of the training image and each category based on the feature expression of each category; updating a similarity calculation result by using the context information of the image and the prior of the driving scene position to generate an initial supervision seed; training a semantic segmentation network by using initial supervision seeds, learning the similarity characteristics of different examples, and updating the similarity of each superpixel to each category; iteratively executing an initial supervision seed generation and similarity updating process until convergence; and storing the converged semantic segmentation network. The invention provides a feasible strategy for the weak supervision semantic segmentation task in the driving scene, and has wide application prospect in the automatic driving and other scenes.

Description

Image semantic segmentation method and system based on single pixel annotation
Technical Field
The invention relates to the technical field of pattern recognition, in particular to a method and a system for image semantic segmentation based on single pixel labeling.
Background
In the field of artificial intelligence computer vision, image semantic segmentation is an important research field, and the task aims to provide pixel-level class labeling for an image and realize an image understanding process.
In recent years, the image understanding task for the driving scene is focused and researched by a plurality of scholars at home and abroad, and the driving scene has more and more interesting performance under the condition of full supervision. The methods rely on a large number of high-precision pixel-level manual labels to realize the training process of the deep neural network. However, these methods often rely on a large amount of data labeling, and the performance of the model is affected by the limitations of the collected data, and often does not have a high enough generalization performance. When the method is oriented to a new scene, new data needs to be collected and labeled, which also limits the application of the method in the driving scene.
On the other hand, the weak supervised learning provides a lightweight method, and a large number of pixel-level image labels are not required to be provided when a semantic segmentation network model is trained, so that the method has wide application prospects in a plurality of fields represented by automatic driving. The existing weak supervision labeling method mainly comprises the modes of providing labels of each category image level, bounding box level and the like for an image, and the labeling modes provide solvable conditions for solving the semantic segmentation task in the natural scene image only containing a small number of category objects. However, when the driving scene is a complex driving scene including a large number of categories, the existing weak supervision labeling method is not light enough, and cannot provide help for learning of each category. Therefore, it is of great significance to provide a lighter and reasonable weak supervision labeling mode facing a complex driving scene.
Under the constraint of weak supervision conditions and complex driving scenes, the design and training difficulty of the algorithm is obviously improved. Here, how to realize the optimal feature coding for each category and how to realize reliable pixel-level segmentation by using the prior information of the target position of each category and the corresponding feature in the driving scene is a difficult problem to be solved urgently in a weak supervision semantic segmentation task facing the driving scene.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an image semantic segmentation method and system based on single pixel labeling, so as to solve the problems of weak supervision labeling facing complex driving scenes and feature coding of each category, and realize reliable pixel-level semantic segmentation under the constraint of weak supervision conditions and complex driving scenes.
In order to solve the technical problems, the invention provides the following technical scheme:
a method for semantic segmentation of an image based on single-pixel labeling, the method comprising:
respectively coding each category by using apparent characteristics and semantic characteristics based on the label of a single pixel of each category, and establishing characteristic expression of each category;
performing superpixel division on the training image, and calculating the similarity of each superpixel of the training image and each class based on the feature expression of each class;
step three, taking the similarity of each super pixel and each category as an initial condition, updating a similarity calculation result by using image context information and driving scene position prior, and generating an initial supervision seed;
step four, training a semantic segmentation network by using the initial supervision seeds, learning the homomorphic characteristics of different instances, and providing an image semantic segmentation result for updating the similarity of each superpixel and each category;
step five, iteratively executing the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged; and storing the semantic segmentation network obtained by the final training for semantic segmentation of the new image.
Further, the labeling manner of the label of each category of single pixel is as follows: for each class, only one training image containing the class is selected from the training image set, and only one pixel belonging to the class is labeled.
Further, the attributes of the categories include objects and scenes; when the categories are coded, the object categories are represented by semantic features; represented by apparent features for the scene category.
Furthermore, the semantic features are that an image to be processed is segmented into fragments with a preset number, then feature extraction is carried out on each fragment based on a pre-trained category activation mapping network model, finally, a semantic feature map with preset dimensions equal to the full map of the image to be processed is obtained, and each object is represented as a semantic feature vector; the appearance features employ features that encode color features and texture features into 96-and 32-dimensions, respectively, and represent each scene as several sets of color features and texture features.
Further, when the attribute of the category is an object, the encoding process of the category includes:
the image to be processed is divided into 15 fragments with equal size, each fragment is coded into a characteristic diagram with dimensions of 16 multiplied by 1000 through a mapping network model, and the characteristic dimension with dimensions of 1000 is normalized; for each pixel in the image, calculating the distance between the coordinate of the pixel and the central coordinates of 15 fragments, and using the 1000-dimensional feature at the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;
dividing the image to be processed into a plurality of super pixels by using a super pixel segmentation method, and using the average value of the semantic heat map responses of all pixels contained in each super pixel as the semantic feature of the super pixel
Figure BDA0002361522860000031
Regarding the pixels labeled with the category, the corresponding 1000-dimensional feature vector is taken as the initial class center of the category and is recorded as the class center
Figure BDA0002361522860000032
Computing
Figure BDA0002361522860000033
And
Figure BDA0002361522860000034
the similarity of (2);
will be mixed with
Figure BDA0002361522860000035
The superpixel with the maximum similarity of the first 1 percent is selected as a set omegag
Alternately updating by E-M method
Figure BDA0002361522860000036
And ΩgUntil stable;
recording the final result
Figure BDA0002361522860000037
As a coded signature of the object class.
Further, when the attribute of the category is a scene, the encoding process of the category includes:
calculating three-channel color characteristics of an image to be processed and texture characteristics of local binarization mode coding, and normalizing; dividing an image to be processed into a plurality of super pixels by using a super pixel segmentation method; for each super pixel, dividing [0,1] into 32 equal intervals in each characteristic channel, and counting the values of the pixels contained in the intervals; thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;
calculating the edge characteristic and the saliency characteristic of the image to be processed, and calculating the similarity of every two super pixel pairs;
recording edge distance measurement between every two superpixels;
determining the superpixels including the pixels of the labeling categories, calculating the similarity between other superpixels in the graph and the superpixels, and recording all the superpixels with the similarity larger than 0.5;
for the recorded super pixels, calculating the color feature similarity and the texture feature similarity of every two super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair; dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the class center of the group;
alternately updating the grouping of class centers and superpixels by an E-M method until the grouping is stable;
and recording the G finally obtained class centers as the coding feature group of the scene class.
Further, the calculating the similarity of each super pixel of the training image with each category includes:
calculating semantic features and apparent features of the image, dividing the image into a plurality of super pixel regions, and generating the semantic features and the apparent features of each super pixel for each super pixel region;
respectively calculating the similarity of each super pixel and each category; for each category belonging to the object, calculating the similarity between the semantic features of the superpixel and the semantic features of the category;
and for each category belonging to the scene, calculating the similarity between the apparent feature of the superpixel and each feature vector in the category coding feature group, and recording the maximum value of the similarity as the similarity between the superpixel and the category.
Further, the initial generation process of the supervision seeds comprises:
calculating edge features and saliency features of the image; wherein the saliency features comprise global saliency features and local saliency features of two partition modes, and the edge features and the three saliency features are used for coding the context similarity measurement of each pair of super pixels, so as to form image context information;
generating a saliency feature of each super pixel and an edge distance metric of each pair of super pixels for each super pixel region based on the super pixel division result obtained in the step two;
recording similarity vectors of all superpixels in the image to the object as a matrix form MobjSimultaneously recording similarity vectors of all superpixels in the image to the scene as a matrix form Msce
Dividing a driving scene image into four equal areas from top to bottom, and specifying an appearance position range of each category; for each super pixel, only retaining the characteristic dimensions corresponding to the categories in the two similarity vectors according to the object and scene categories defined and contained in the region where the super pixel is located;
recording the maximum value in the two similarity vectors and the category corresponding to the maximum value; for two vectors corresponding to each super pixel, if the maximum value corresponding to the object type is greater than 0.05, recording the super pixel type as the object type; if the maximum value corresponding to the object type is not more than 0.05 and the maximum value corresponding to the scene type is more than 0.05, recording the superpixel type as the scene type; if the two conditions are not met, the super pixel area is not used in training;
recording the category of the superpixel, corresponding the superpixel to the original image, setting all pixels belonging to the position of the superpixel to the same category as the superpixel, obtaining label information of the whole image, and recording the label information as an initial supervision seed.
Further, when training the semantic segmentation network in the fourth step, counting the semantic segmentation result of the pixels in each super-pixel region, and taking the ratio belonging to each category as the similarity of the super-pixel and each category;
and in the iteration process of the step five, replacing the previous similarity with the new similarity obtained in the step four, and alternately iterating the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged.
Accordingly, in order to solve the above technical problems, the present invention further provides the following technical solutions:
a single-pixel annotation based image semantic segmentation system, the system comprising:
the category coding module is used for respectively coding each category by utilizing the apparent characteristics and the semantic characteristics based on the label of each category single pixel and establishing the characteristic expression of each category;
the similarity calculation module is used for performing superpixel division on the training images and calculating the similarity between each superpixel of the training images and each category based on the feature expression of each category;
the initial supervision seed generation module is used for updating a similarity calculation result by using the similarity of each super pixel and each category as an initial condition and using image context information and driving scene position prior to generate an initial supervision seed;
the semantic segmentation network training module is used for training a semantic segmentation network by utilizing the initial supervision seeds, learning the homomorphic characteristics of different examples, providing an image semantic segmentation result and updating the similarity of each superpixel and each category;
an iteration module for iteratively executing the initial supervision seed generation module and the semantic segmentation network training module until the semantic segmentation performance of the semantic segmentation network converges;
and the semantic segmentation network storage module is used for storing the semantic segmentation network obtained by the final one-time training and is used for performing semantic segmentation on the new image.
The technical scheme of the invention has the following beneficial effects:
the image semantic segmentation method based on single pixel labeling provides a lightweight labeling condition, and only one pixel point is labeled for each category; the image semantic segmentation performance is iteratively optimized by alternately realizing the region category similarity calculation based on the image context and a large number of example similarity feature learning processes based on a semantic segmentation network, so that the high-precision segmentation of objects in the image in a driving scene is realized under the condition that each category is only labeled by a single pixel; a feasible strategy is provided for the weak supervision semantic segmentation task in the driving scene, and the application of the feasible strategy in the scenes such as automatic driving and the like has wide prospect.
Drawings
FIG. 1 is a schematic flow chart of a single-pixel labeling-based image semantic segmentation method according to the present invention;
FIG. 2 is a sample diagram for obtaining semantic features of a driving scenario according to the present invention;
FIG. 3 is a schematic flow chart of the present invention for generating the encoding feature or feature group of each category from the single pixel label of the category;
FIG. 4 is a sample diagram of the present invention for extracting edge features, local and global saliency features of an image;
FIG. 5 is a schematic diagram of the area division of the driving scene and the occurrence position range of each category;
FIG. 6 is a graph illustrating a curve of the increase of the segmentation performance with the number of iterations of the image semantic segmentation method based on single pixel labeling according to the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
First embodiment
Referring to fig. 1 to fig. 6, the present embodiment provides a method for segmenting image semantics based on single pixel labeling, which includes the following steps:
respectively coding each category by using apparent characteristics and semantic characteristics based on the label of a single pixel of each category, and establishing characteristic expression of each category;
the labeling mode of the category label is as follows: and for a complex driving scene, marking each category with a single pixel, selecting only one training image containing the single pixel from the training image set, marking only one pixel belonging to the category, and setting a task marking condition.
Specifically, for an image semantic segmentation task in a driving scene, if a category label required to be provided in an image includes C categories, then, for a training set including N images, a part of the images is selected in this embodiment, so that all the categories can be covered by at least one sample in the selected image, then, for each category, a pixel is subjected to category labeling in the selected sample, that is, for an image semantic segmentation task including C categories, in the entire training set including N images, this embodiment only selects one pixel belonging to the category for labeling for each category, that is, in the entire training set, only C pixel points belonging to K images (K is less than or equal to C, K is less than or equal to N) have category labeling.
Based on the labeling conditions, each category is divided into two attributes of an object and a scene, and the optimal feature expression of each category is coded. For a driving scene including C types of categories, in this embodiment, first, based on characteristics of each category, instances of vehicles, pedestrians, and the like are defined as objects, non-instances of roads, sky, and the like are defined as scenes, and by analyzing feature expressions of pixels that are labeled singly in each category, two types of attributes are encoded by using different suitable features, and feature expressions of C categories are established. Representing object classes by semantic features; the scene category is represented by apparent features. The semantic feature method comprises the steps of segmenting an image to be processed into a preset number of fragments, then performing feature extraction on each fragment based on a pre-trained category activation mapping network model, finally obtaining a semantic feature map with preset dimensions equal to the full map of the image to be processed, and representing each object as a semantic feature vector; the appearance features employ features that encode color and texture features into 96-and 32-dimensions, respectively, and represent each scene as several sets of color and texture features.
Specifically, the encoding flow adopted in the present embodiment is as shown in fig. 3. Firstly, for each image containing labeled pixels, the type attribute corresponding to each labeled pixel contained in the image is judged, if the image belongs to an object, the method is executed according to the following steps, and the coding feature representing the type is calculated:
s1, utilizing a pre-trained category activation mapping network (CAM) on ImageNet, obtaining a heat map by adopting a multi-overlapping fragment fusion strategy, and encoding semantic features of the image, wherein an example is shown in FIG. 2. First, the present embodiment divides the image into 15 regions of equal size. For an image with length L and width W, each region has a length L/2 and width W/3. The top left pixel of the 15 regions has longitudinal coordinates of (0, L/4, L/2) and transverse coordinates of (0, W/6, W/3, W/2, W2/3). In the embodiment, each region is encoded into a feature map with dimensions of 16 × 16 × 1000 by CAM, and the embodiment normalizes the feature map for a feature dimension with dimension of 1000. Then, for each pixel in the image, calculating the distance between the pixel coordinate and the 15 area fragment center coordinates, and using the 1000-dimensional feature on the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;
s2, the image is divided into K superpixels by the superpixel division method. For each super pixel spiUsing the average value of the response characteristics of all the pixels contained in the super-pixel as the semantic characteristic of the super-pixel
Figure BDA0002361522860000071
S3, regarding the marked pixel, the corresponding 1000-dimensional feature vector is taken as the initial center of the category and is marked as
Figure BDA0002361522860000072
Using (equation 1), will
Figure BDA0002361522860000073
And
Figure BDA0002361522860000074
as XiAnd XjCalculating the similarity;
Figure BDA0002361522860000075
s4, selecting the superpixel with the highest similarity with the class center as a set omegag
S5, through (formula 2), alternately updating the value of the class center by the E-M method
Figure BDA0002361522860000076
And the selected set omegagUntil stable.
Figure BDA0002361522860000077
S6, recording the finally obtained class center
Figure BDA0002361522860000078
As a coded signature of the object class.
If it belongs to a scene, it is performed according to the following steps, calculating the set of coding features representing the category:
and S1, calculating three-channel color characteristics of the image and texture characteristics of Local Binary Pattern (LBP) coding, and normalizing. And dividing the image into K superpixels by using a superpixel division method. For each super pixel spiAt each featureIn the channel, this example will be [0,1]]The data are divided into 32 equal intervals, and the values of the pixels are counted. Thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;
and S2, calculating the edge characteristics and the salient characteristics of the image. Specifically, when calculating the saliency features, on one hand, the embodiment directly takes the original image as an object to extract the saliency features, and on the other hand, the embodiment respectively adopts two different modes to segment the image into a plurality of fragments, extract the saliency features for each fragment, and correspondingly splice feature maps of the fragments into a complete map, as shown in the example of fig. 4. In the first mode, the present embodiment divides the image into 5 regions: the upper 1/4 region, the lower 1/4 region, and the remaining middle region, trisected from left to right. In the second mode, the present embodiment divides the long direction of the image into two equal parts, and divides the wide defensive line into four equal parts, resulting in 8 equal-sized fragment areas. Further, the present embodiment calculates two-by-two super pixel pairs sp by using (formula 1)iAnd spjSimilarity of (c), recorded as Sim for global significanceg(i, j) for local saliency recorded as
Figure BDA0002361522860000081
And
Figure BDA0002361522860000082
s3, this example records an edge distance measure between two superpixels, defined as: the value with E as the base number and the exponent of 3.5 times the opposite number of the sum of the values of the pixels belonging to the two superpixel center-point connecting lines on the edge feature map is marked as E (i, j). It should be noted here that, on the edge feature map, the larger the pixel value is, the more prominent the edge is, and therefore, in the definition of the present embodiment, the larger the edge distance measure of two superpixels is, the less prominent the edge representing two superpixels directly is.
S4, marking the super pixel containing the marked pixel as spannoAnd calculating the similarity between other superpixels in the graph and the superpixel by using (formula 3), and recording all the superpixels with the similarity more than 0.5And (4) element.
Figure BDA0002361522860000083
And S5, calculating the color feature similarity and the texture feature similarity of every two super pixels by using the (formula 1) of the recorded super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair. Dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the initial class center of the group
Figure BDA0002361522860000084
S6, using (formula 2), alternately updating the value of the class center by E-M method
Figure BDA0002361522860000085
And grouping of superpixels until stable.
S7, recording the G class centers finally obtained
Figure BDA0002361522860000086
As a set of coding features for the scene category.
Performing superpixel division on the training image, and calculating the similarity of each superpixel of the training image and each class based on the feature expression of each class;
specifically, for each image in the training set, the following steps are taken in the embodiment:
s1, calculating semantic features and apparent features of the image, dividing the image into a plurality of super-pixel regions, and generating the semantic features and the apparent features of each super-pixel in the embodiment by adopting the same method as the step I for each super-pixel region;
s2, calculating each superpixel sp respectivelyiSimilarity to objects and scene categories. For each class belonging to the object, the similarity between the semantic feature of the super pixel and the semantic feature of each class is calculated by using (formula 1), and the similarity is compared with all the semantic features of all the classesThe similarity of object classes is recorded as a vector
Figure BDA0002361522860000091
S3, for each category belonging to the scene, calculating the similarity between the apparent feature of the super pixel and each feature vector in the feature group of the category by using (formula 1), recording the maximum value as the similarity between the super pixel and the category, and finally recording the similarity between the super pixel and all the scene categories as vectors
Figure BDA0002361522860000092
Step three, taking the similarity of each super pixel and each category as an initial condition, updating a similarity calculation result by using image context information and driving scene position prior, and generating an initial supervision seed;
specifically, for each image in the training set, the following steps are taken in the embodiment:
and S1, calculating the edge characteristics and the salient characteristics of the image. Here, the way of calculating the edge, salient feature and step one are exactly the same.
S2, based on the super-pixel division result in the step two, for each super-pixel area, generating the significance characteristic of each super-pixel and the edge distance measurement of each pair of super-pixels by adopting the same method in the step one;
s3, recording similarity vectors of all K superpixels in the image to the object as a matrix form MobjEach column of the matrix represents the similarity of one superpixel to each object class, and each row represents the similarity of all superpixels to one object class. Processing the scene similarity in the same way to obtain a matrix Msce. And (3) calculating to obtain a context similarity matrix between every two super pixels.
S4, using (formula 4), updating matrix MobjAnd MsceTo obtain a matrix
Figure BDA0002361522860000093
And
Figure BDA0002361522860000094
Figure BDA0002361522860000095
s5, for each super pixel spiThe embodiment records the updated object similarity vector and scene similarity vector;
s6, in the driving scene, each category has its area in which it appears intensively, and for one image, the present embodiment divides it into four equal areas from top to bottom, and specifies the range of the appearance position of each category, as shown in the schematic diagram 5. For each super-pixel, the present embodiment only retains the feature dimensions corresponding to the object and scene categories in the two similarity vectors according to the object and scene categories defined and contained in the region where the super-pixel is located.
S7, for the two processed vectors, the present embodiment records the maximum value and the object (or scene) class corresponding to the maximum value, and defines the object class of the super pixel as Clsobj(spi) The scene category is Clssce(spi). Here, for two vectors corresponding to each superpixel, if the maximum value corresponding to the object class is greater than 0.05, recording the superpixel class as the object class; if the maximum value corresponding to the object type is not more than 0.05 and the maximum value corresponding to the scene type is more than 0.05, recording the superpixel type as the scene type; if the two conditions are not met, the super pixel area is not used in training; that is, if the maximum value corresponding to the object (or scene) category is less than 0.05, the present embodiment records the object (or scene) category as 255, that is, it is not used during training.
S8, the present embodiment records the final classification of the superpixel using (equation 5). Further, the superpixels are corresponded to the positions of the superpixels in the image, all pixels belonging to the superpixel positions are set as class labels the same as the superpixels, label information of the whole image is obtained and recorded as initial supervision seeds, and the initial supervision seeds are used for training of the semantic segmentation network.
Figure BDA0002361522860000101
Training a semantic segmentation network by using the initial supervision seeds, learning the homomorphic characteristics of different instances, and providing an image semantic segmentation result for updating the similarity of each superpixel and each category;
when the semantic segmentation network is trained, the process is realized by utilizing all training images and labels generated by corresponding images in a consistent manner by adopting a semantic segmentation method under the condition of full supervision learning.
The specific training process is as follows:
given N training images IiAnd label G correspondingly generated by the N pieces obtained in the step threeiTraining a segmentation network f (theta, I) with theta as a parameter, the output of the network representing the probability that the label y of the pixel pix belongs to the class c, i.e. fpix(θ,I)=Ppix(y ═ c | I). The loss function adopted in the training is a cross entropy function.
Step five, iteratively executing the step three to the step four until the segmentation performance of the semantic segmentation network is converged;
and in the iteration process of the step five, replacing the previous similarity with the new similarity obtained in the step four, and alternately iterating the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged.
The specific method comprises the following steps: and for each super-pixel region, counting semantic segmentation results of pixels in the super-pixels, and taking the ratio belonging to each category as the similarity of the super-pixels and each category. It should be noted that, through the context information correction in the third step and the semantic segmentation network homology learning process in the fourth step, the accuracy of the provided class similarity is significantly higher than the initial class similarity provided in the second step.
And repeatedly and alternately executing the third step and the fourth step, and fully fusing the information of the same object based on the image context and different examples between the images to obtain more and more accurate semantic segmentation results. Fig. 6 shows a growth curve of the performance of the segmented network of the embodiment on the cityscaps semantic segmentation data set as the number of iterations increases.
And step six, storing the segmentation network obtained by the final training for semantic segmentation of the new image.
It should be noted that after the training process of the first step to the fifth step, in the practical application, only the semantic segmentation network obtained at the end of the fifth step needs to be used for semantic segmentation inference of the new image, so that the algorithm has higher efficiency in the practical application.
The image semantic segmentation method based on single pixel labeling provides a lightweight labeling condition, and only one pixel point is labeled for each category; the image semantic segmentation performance is iteratively optimized by alternately realizing the region category similarity calculation based on the image context and a large number of example similarity feature learning processes based on a semantic segmentation network, so that the high-precision segmentation of objects in the image in a driving scene is realized under the condition that each category is only labeled by a single pixel; a feasible strategy is provided for the weak supervision semantic segmentation task in the driving scene, and the application of the feasible strategy in the scenes such as automatic driving and the like has wide prospect.
Second embodiment
The embodiment provides an image semantic segmentation system based on single pixel annotation, which includes:
the category coding module is used for respectively coding each category by utilizing the apparent characteristics and the semantic characteristics based on the label of each category single pixel and establishing the characteristic expression of each category;
the similarity calculation module is used for performing superpixel division on the training images and calculating the similarity between each superpixel of the training images and each category based on the feature expression of each category;
the initial supervision seed generation module is used for updating a similarity calculation result by using the similarity of each super pixel and each category as an initial condition and using image context information and driving scene position prior to generate an initial supervision seed;
the semantic segmentation network training module is used for training a semantic segmentation network by utilizing the initial supervision seeds, learning the homomorphic characteristics of different examples, providing an image semantic segmentation result and updating the similarity of each superpixel and each category;
an iteration module for iteratively executing the initial supervision seed generation module and the semantic segmentation network training module until the semantic segmentation performance of the semantic segmentation network converges;
and the semantic segmentation network storage module is used for storing the semantic segmentation network obtained by the final one-time training and is used for performing semantic segmentation on the new image.
The image semantic segmentation system based on single pixel labeling of the embodiment corresponds to the image semantic segmentation method based on single pixel labeling of the first embodiment; the functions realized by the functional modules in the image semantic segmentation system based on single pixel labeling of the embodiment correspond to the flow steps in the image semantic segmentation method based on single pixel labeling one to one; therefore, it is not described herein.
Furthermore, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (6)

1. A semantic segmentation method for an image based on single pixel labeling is characterized by comprising the following steps:
respectively coding each category by using apparent characteristics and semantic characteristics based on the label of a single pixel of each category, and establishing characteristic expression of each category;
performing superpixel division on the training image, and calculating the similarity of each superpixel of the training image and each class based on the feature expression of each class;
step three, taking the similarity of each super pixel and each category as an initial condition, updating a similarity calculation result by using image context information and driving scene position prior, and generating an initial supervision seed;
step four, training a semantic segmentation network by using the initial supervision seeds, learning the homomorphic characteristics of different instances, and providing an image semantic segmentation result for updating the similarity of each superpixel and each category;
step five, iteratively executing the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged; storing the semantic segmentation network obtained by the final training for carrying out semantic segmentation on the new image;
attributes of the categories include objects and scenes; when the categories are coded, the object categories are represented by semantic features; represented by apparent features for scene categories; the semantic features are that an image to be processed is segmented into a preset number of fragments, then feature extraction is carried out on each fragment based on a pre-trained category activation mapping network model, finally, a semantic feature map with preset dimensions equal to the full map of the image to be processed is obtained, and each object is represented as a semantic feature vector; the apparent features adopt features of respectively encoding color features and texture features into 96-dimension and 32-dimension features, and each scene is represented into a plurality of groups of color features and texture features;
when the attribute of the category is the object, the encoding process of the category comprises the following steps:
the image to be processed is divided into 15 fragments with equal size, each fragment is coded into a characteristic diagram with dimensions of 16 multiplied by 1000 through a mapping network model, and the characteristic dimension with dimensions of 1000 is normalized; for each pixel in the image, calculating the distance between the coordinate of the pixel and the central coordinates of 15 fragments, and using the 1000-dimensional feature at the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;
dividing the image to be processed into a plurality of super pixels by using a super pixel segmentation method, and using the average value of the semantic heat map responses of all pixels contained in each super pixel as the semantic feature of the super pixel
Figure FDA0003007978230000011
Regarding the pixels labeled with the category, the corresponding 1000-dimensional feature vector is taken as the initial class center of the category and is recorded as the class center
Figure FDA0003007978230000012
Computing
Figure FDA0003007978230000013
And
Figure FDA0003007978230000014
the similarity of (2);
will be mixed with
Figure FDA0003007978230000021
The superpixel with the maximum similarity of the first 1 percent is selected as a set omegag
Alternately updating by E-M method
Figure FDA0003007978230000022
And Ω g until stable;
recording the final result
Figure FDA0003007978230000023
As a coded feature of the object class;
when the attribute of the category is a scene, the encoding process of the category comprises the following steps:
calculating three-channel color characteristics of an image to be processed and texture characteristics of local binarization mode coding, and normalizing; dividing an image to be processed into a plurality of super pixels by using a super pixel segmentation method; for each super pixel, dividing [0,1] into 32 equal intervals in each characteristic channel, and counting the values of the pixels contained in the intervals; thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;
calculating the edge characteristic and the saliency characteristic of the image to be processed, and calculating the similarity of every two super pixel pairs;
recording edge distance measurement between every two superpixels;
determining the superpixels including the pixels of the labeling categories, calculating the similarity between other superpixels in the graph and the superpixels, and recording all the superpixels with the similarity larger than 0.5;
for the recorded super pixels, calculating the color feature similarity and the texture feature similarity of every two super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair; dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the class center of the group;
alternately updating the grouping of class centers and superpixels by an E-M method until the grouping is stable;
and recording the G finally obtained class centers as the coding feature group of the scene class.
2. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein labels of single pixels in each category are labeled in a manner that: for each class, only one training image containing the class is selected from the training image set, and only one pixel belonging to the class is labeled.
3. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein the calculating the similarity of each super-pixel of the training images to each category comprises:
calculating semantic features and apparent features of the image, dividing the image into a plurality of super pixel regions, and generating the semantic features and the apparent features of each super pixel for each super pixel region;
respectively calculating the similarity of each super pixel and each category; for each category belonging to the object, calculating the similarity between the semantic features of the superpixel and the semantic features of the category;
and for each category belonging to the scene, calculating the similarity between the apparent feature of the superpixel and each feature vector in the category coding feature group, and recording the maximum value of the similarity as the similarity between the superpixel and the category.
4. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein the generation process of the initial supervision seed comprises:
calculating edge features and saliency features of the image; wherein the saliency features comprise global saliency features and local saliency features of two partition modes, and the edge features and the three saliency features are used for coding the context similarity measurement of each pair of super pixels, so as to form image context information;
generating a saliency feature of each super pixel and an edge distance metric of each pair of super pixels for each super pixel region based on the super pixel division result obtained in the step two;
recording similarity vectors of all superpixels in the image to the object as a matrix form MobjSimultaneously recording similarity vectors of all superpixels in the image to the scene as a matrix form Msce
Dividing a driving scene image into four equal areas from top to bottom, and specifying an appearance position range of each category; for each super pixel, only retaining the characteristic dimensions corresponding to the categories in the two similarity vectors according to the object and scene categories defined and contained in the region where the super pixel is located;
recording the maximum value in the two similarity vectors and the category corresponding to the maximum value; for two vectors corresponding to each super pixel, if the maximum value corresponding to the object type is greater than 0.05, recording the super pixel type as the object type; if the maximum value corresponding to the object type is not more than 0.05 and the maximum value corresponding to the scene type is more than 0.05, recording the superpixel type as the scene type; if the two conditions are not met, the super pixel area is not used in training;
recording the category of the superpixel, corresponding the superpixel to the original image, setting all pixels belonging to the position of the superpixel to the same category as the superpixel, obtaining label information of the whole image, and recording the label information as an initial supervision seed.
5. The image semantic segmentation method based on single pixel labeling according to claim 1, wherein when training the semantic segmentation network in the fourth step, the semantic segmentation result of the pixels in each super-pixel region is counted, and the ratio belonging to each category is used as the similarity between the super-pixel and each category;
and in the iteration process of the step five, replacing the previous similarity with the new similarity obtained in the step four, and alternately iterating the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged.
6. An image semantic segmentation system based on single pixel labeling, comprising:
the category coding module is used for respectively coding each category by utilizing the apparent characteristics and the semantic characteristics based on the label of each category single pixel and establishing the characteristic expression of each category;
the similarity calculation module is used for performing superpixel division on the training images and calculating the similarity between each superpixel of the training images and each category based on the feature expression of each category;
the initial supervision seed generation module is used for updating a similarity calculation result by using the similarity of each super pixel and each category as an initial condition and using image context information and driving scene position prior to generate an initial supervision seed;
the semantic segmentation network training module is used for training a semantic segmentation network by utilizing the initial supervision seeds, learning the homomorphic characteristics of different examples, providing an image semantic segmentation result and updating the similarity of each superpixel and each category;
an iteration module for iteratively executing the initial supervision seed generation module and the semantic segmentation network training module until the semantic segmentation performance of the semantic segmentation network converges;
the semantic segmentation network storage module is used for storing a semantic segmentation network obtained by the final training and is used for carrying out semantic segmentation on a new image;
attributes of the categories include objects and scenes; when the categories are coded, the object categories are represented by semantic features; represented by apparent features for scene categories; the semantic features are that an image to be processed is segmented into a preset number of fragments, then feature extraction is carried out on each fragment based on a pre-trained category activation mapping network model, finally, a semantic feature map with preset dimensions equal to the full map of the image to be processed is obtained, and each object is represented as a semantic feature vector; the apparent features adopt features of respectively encoding color features and texture features into 96-dimension and 32-dimension features, and each scene is represented into a plurality of groups of color features and texture features;
when the attribute of the category is the object, the encoding process of the category comprises the following steps:
the image to be processed is divided into 15 fragments with equal size, each fragment is coded into a characteristic diagram with dimensions of 16 multiplied by 1000 through a mapping network model, and the characteristic dimension with dimensions of 1000 is normalized; for each pixel in the image, calculating the distance between the coordinate of the pixel and the central coordinates of 15 fragments, and using the 1000-dimensional feature at the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;
dividing the image to be processed into a plurality of super pixels by using a super pixel segmentation method, and using the average value of the semantic heat map responses of all pixels contained in each super pixel as the semantic feature of the super pixel
Figure FDA0003007978230000047
Regarding the pixels labeled with the category, the corresponding 1000-dimensional feature vector is taken as the initial class center of the category and is recorded as the class center
Figure FDA0003007978230000041
Computing
Figure FDA0003007978230000042
And
Figure FDA0003007978230000043
the similarity of (2);
will be mixed with
Figure FDA0003007978230000044
The superpixel with the maximum similarity of the first 1 percent is selected as a set omegag
Alternately updating by E-M method
Figure FDA0003007978230000045
And Ω g until stable;
recording the final result
Figure FDA0003007978230000046
As a coded feature of the object class;
when the attribute of the category is a scene, the encoding process of the category comprises the following steps:
calculating three-channel color characteristics of an image to be processed and texture characteristics of local binarization mode coding, and normalizing; dividing an image to be processed into a plurality of super pixels by using a super pixel segmentation method; for each super pixel, dividing [0,1] into 32 equal intervals in each characteristic channel, and counting the values of the pixels contained in the intervals; thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;
calculating the edge characteristic and the saliency characteristic of the image to be processed, and calculating the similarity of every two super pixel pairs;
recording edge distance measurement between every two superpixels;
determining the superpixels including the pixels of the labeling categories, calculating the similarity between other superpixels in the graph and the superpixels, and recording all the superpixels with the similarity larger than 0.5;
for the recorded super pixels, calculating the color feature similarity and the texture feature similarity of every two super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair; dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the class center of the group;
alternately updating the grouping of class centers and superpixels by an E-M method until the grouping is stable;
and recording the G finally obtained class centers as the coding feature group of the scene class.
CN202010023166.0A 2020-01-09 2020-01-09 Image semantic segmentation method and system based on single pixel annotation Active CN111259936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010023166.0A CN111259936B (en) 2020-01-09 2020-01-09 Image semantic segmentation method and system based on single pixel annotation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010023166.0A CN111259936B (en) 2020-01-09 2020-01-09 Image semantic segmentation method and system based on single pixel annotation

Publications (2)

Publication Number Publication Date
CN111259936A CN111259936A (en) 2020-06-09
CN111259936B true CN111259936B (en) 2021-06-01

Family

ID=70953930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010023166.0A Active CN111259936B (en) 2020-01-09 2020-01-09 Image semantic segmentation method and system based on single pixel annotation

Country Status (1)

Country Link
CN (1) CN111259936B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085739A (en) * 2020-08-20 2020-12-15 深圳力维智联技术有限公司 Semantic segmentation model training method, device and equipment based on weak supervision
CN112241959A (en) * 2020-09-23 2021-01-19 天津大学 Attention mechanism generation semantic segmentation method based on superpixels
CN112800265B (en) * 2021-02-01 2022-03-08 中国科学院空天信息创新研究院 Image segmentation data annotation method and system based on unsupervised or weakly supervised mode
CN112927244A (en) * 2021-03-31 2021-06-08 清华大学 Three-dimensional scene segmentation method and device under weak supervision
CN113361363B (en) * 2021-05-31 2024-02-06 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for face image recognition model
CN113344012B (en) * 2021-07-14 2022-08-23 马上消费金融股份有限公司 Article identification method, device and equipment
CN113780532B (en) * 2021-09-10 2023-10-27 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of semantic segmentation network
CN114758128B (en) * 2022-04-11 2024-04-16 西安交通大学 Scene panorama segmentation method and system based on controlled pixel embedding characterization explicit interaction
CN116012843B (en) * 2023-03-24 2023-06-30 北京科技大学 Virtual scene data annotation generation method and system
CN116740360A (en) * 2023-08-10 2023-09-12 荣耀终端有限公司 Image processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529508A (en) * 2016-12-07 2017-03-22 西安电子科技大学 Local and non-local multi-feature semantics-based hyperspectral image classification method
CN109063723A (en) * 2018-06-11 2018-12-21 清华大学 The Weakly supervised image, semantic dividing method of object common trait is excavated based on iteration
CN110084136A (en) * 2019-04-04 2019-08-02 北京工业大学 Context based on super-pixel CRF model optimizes indoor scene semanteme marking method
CN110163239A (en) * 2019-01-25 2019-08-23 太原理工大学 A kind of Weakly supervised image, semantic dividing method based on super-pixel and condition random field

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396546B2 (en) * 2014-01-21 2016-07-19 Adobe Systems Incorporated Labeling objects in image scenes
US10635927B2 (en) * 2017-03-06 2020-04-28 Honda Motor Co., Ltd. Systems for performing semantic segmentation and methods thereof
CN107169487B (en) * 2017-04-19 2020-02-07 西安电子科技大学 Salient object detection method based on superpixel segmentation and depth feature positioning
CN110378911B (en) * 2019-07-11 2022-06-21 太原科技大学 Weak supervision image semantic segmentation method based on candidate region and neighborhood classifier
CN110598705B (en) * 2019-09-27 2022-02-22 腾讯科技(深圳)有限公司 Semantic annotation method and device for image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529508A (en) * 2016-12-07 2017-03-22 西安电子科技大学 Local and non-local multi-feature semantics-based hyperspectral image classification method
CN109063723A (en) * 2018-06-11 2018-12-21 清华大学 The Weakly supervised image, semantic dividing method of object common trait is excavated based on iteration
CN110163239A (en) * 2019-01-25 2019-08-23 太原理工大学 A kind of Weakly supervised image, semantic dividing method based on super-pixel and condition random field
CN110084136A (en) * 2019-04-04 2019-08-02 北京工业大学 Context based on super-pixel CRF model optimizes indoor scene semanteme marking method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Weakly-Supervised Image Semantic Segmentation Based on Superpixel Region Merging;Quanchun Jiang et al;《big data and cognitive computing》;20190610;第3卷(第31期);1-20页 *
基于内容的图像分割方法综述;姜枫等;《软件学报》;20170131;第28卷(第1期);160-183页 *

Also Published As

Publication number Publication date
CN111259936A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111259936B (en) Image semantic segmentation method and system based on single pixel annotation
He et al. Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds
Adarsh et al. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model
Melekhov et al. Dgc-net: Dense geometric correspondence network
CN106875406B (en) Image-guided video semantic object segmentation method and device
Li et al. A deep learning method of water body extraction from high resolution remote sensing images with multisensors
Dornaika et al. Building detection from orthophotos using a machine learning approach: An empirical study on image segmentation and descriptors
CN109325484B (en) Flower image classification method based on background prior significance
CN113449594B (en) Multilayer network combined remote sensing image ground semantic segmentation and area calculation method
Huang et al. GraNet: Global relation-aware attentional network for semantic segmentation of ALS point clouds
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
He et al. Robust road detection from a single image using road shape prior
CN106557579A (en) A kind of vehicle model searching system and method based on convolutional neural networks
Luo et al. Cross-spatiotemporal land-cover classification from VHR remote sensing images with deep learning based domain adaptation
CN105046714A (en) Unsupervised image segmentation method based on super pixels and target discovering mechanism
CN105740915A (en) Cooperation segmentation method fusing perception information
CN113988147B (en) Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device
Li et al. MVF-CNN: Fusion of multilevel features for large-scale point cloud classification
CN109961437A (en) A kind of conspicuousness fabric defect detection method under the mode based on machine teaching
CN114283162A (en) Real scene image segmentation method based on contrast self-supervision learning
CN113592894A (en) Image segmentation method based on bounding box and co-occurrence feature prediction
CN113657414B (en) Object identification method
CN114529832A (en) Method and device for training preset remote sensing image overlapping shadow segmentation model
CN112446417B (en) Spindle-shaped fruit image segmentation method and system based on multilayer superpixel segmentation
CN104408158A (en) Viewpoint tracking method based on geometrical reconstruction and semantic integration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant