CN111259936B

CN111259936B - Image semantic segmentation method and system based on single pixel annotation

Info

Publication number: CN111259936B
Application number: CN202010023166.0A
Authority: CN
Inventors: 马惠敏; 李熹; 储华珍; 陈衍先; 易生
Original assignee: Tsinghua University; University of Science and Technology Beijing USTB
Current assignee: Tsinghua University; University of Science and Technology Beijing USTB
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2021-06-01
Anticipated expiration: 2040-01-09
Also published as: CN111259936A

Abstract

The invention provides an image semantic segmentation method and system based on single pixel labeling, wherein the method comprises the following steps: respectively coding each category by using the apparent characteristics and the semantic characteristics based on the label of the single pixel of each category; calculating the similarity of each super pixel of the training image and each category based on the feature expression of each category; updating a similarity calculation result by using the context information of the image and the prior of the driving scene position to generate an initial supervision seed; training a semantic segmentation network by using initial supervision seeds, learning the similarity characteristics of different examples, and updating the similarity of each superpixel to each category; iteratively executing an initial supervision seed generation and similarity updating process until convergence; and storing the converged semantic segmentation network. The invention provides a feasible strategy for the weak supervision semantic segmentation task in the driving scene, and has wide application prospect in the automatic driving and other scenes.

Description

Image semantic segmentation method and system based on single pixel annotation

Technical Field

The invention relates to the technical field of pattern recognition, in particular to a method and a system for image semantic segmentation based on single pixel labeling.

Background

In the field of artificial intelligence computer vision, image semantic segmentation is an important research field, and the task aims to provide pixel-level class labeling for an image and realize an image understanding process.

In recent years, the image understanding task for the driving scene is focused and researched by a plurality of scholars at home and abroad, and the driving scene has more and more interesting performance under the condition of full supervision. The methods rely on a large number of high-precision pixel-level manual labels to realize the training process of the deep neural network. However, these methods often rely on a large amount of data labeling, and the performance of the model is affected by the limitations of the collected data, and often does not have a high enough generalization performance. When the method is oriented to a new scene, new data needs to be collected and labeled, which also limits the application of the method in the driving scene.

On the other hand, the weak supervised learning provides a lightweight method, and a large number of pixel-level image labels are not required to be provided when a semantic segmentation network model is trained, so that the method has wide application prospects in a plurality of fields represented by automatic driving. The existing weak supervision labeling method mainly comprises the modes of providing labels of each category image level, bounding box level and the like for an image, and the labeling modes provide solvable conditions for solving the semantic segmentation task in the natural scene image only containing a small number of category objects. However, when the driving scene is a complex driving scene including a large number of categories, the existing weak supervision labeling method is not light enough, and cannot provide help for learning of each category. Therefore, it is of great significance to provide a lighter and reasonable weak supervision labeling mode facing a complex driving scene.

Under the constraint of weak supervision conditions and complex driving scenes, the design and training difficulty of the algorithm is obviously improved. Here, how to realize the optimal feature coding for each category and how to realize reliable pixel-level segmentation by using the prior information of the target position of each category and the corresponding feature in the driving scene is a difficult problem to be solved urgently in a weak supervision semantic segmentation task facing the driving scene.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an image semantic segmentation method and system based on single pixel labeling, so as to solve the problems of weak supervision labeling facing complex driving scenes and feature coding of each category, and realize reliable pixel-level semantic segmentation under the constraint of weak supervision conditions and complex driving scenes.

In order to solve the technical problems, the invention provides the following technical scheme:

a method for semantic segmentation of an image based on single-pixel labeling, the method comprising:

respectively coding each category by using apparent characteristics and semantic characteristics based on the label of a single pixel of each category, and establishing characteristic expression of each category;

performing superpixel division on the training image, and calculating the similarity of each superpixel of the training image and each class based on the feature expression of each class;

step three, taking the similarity of each super pixel and each category as an initial condition, updating a similarity calculation result by using image context information and driving scene position prior, and generating an initial supervision seed;

step four, training a semantic segmentation network by using the initial supervision seeds, learning the homomorphic characteristics of different instances, and providing an image semantic segmentation result for updating the similarity of each superpixel and each category;

step five, iteratively executing the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged; and storing the semantic segmentation network obtained by the final training for semantic segmentation of the new image.

Further, the labeling manner of the label of each category of single pixel is as follows: for each class, only one training image containing the class is selected from the training image set, and only one pixel belonging to the class is labeled.

Further, the attributes of the categories include objects and scenes; when the categories are coded, the object categories are represented by semantic features; represented by apparent features for the scene category.

Furthermore, the semantic features are that an image to be processed is segmented into fragments with a preset number, then feature extraction is carried out on each fragment based on a pre-trained category activation mapping network model, finally, a semantic feature map with preset dimensions equal to the full map of the image to be processed is obtained, and each object is represented as a semantic feature vector; the appearance features employ features that encode color features and texture features into 96-and 32-dimensions, respectively, and represent each scene as several sets of color features and texture features.

Further, when the attribute of the category is an object, the encoding process of the category includes:

the image to be processed is divided into 15 fragments with equal size, each fragment is coded into a characteristic diagram with dimensions of 16 multiplied by 1000 through a mapping network model, and the characteristic dimension with dimensions of 1000 is normalized; for each pixel in the image, calculating the distance between the coordinate of the pixel and the central coordinates of 15 fragments, and using the 1000-dimensional feature at the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;

dividing the image to be processed into a plurality of super pixels by using a super pixel segmentation method, and using the average value of the semantic heat map responses of all pixels contained in each super pixel as the semantic feature of the super pixel

Regarding the pixels labeled with the category, the corresponding 1000-dimensional feature vector is taken as the initial class center of the category and is recorded as the class center

Computing

And

the similarity of (2);

will be mixed with

The superpixel with the maximum similarity of the first 1 percent is selected as a set omega_g；

Alternately updating by E-M method

And Ω_gUntil stable;

recording the final result

As a coded signature of the object class.

Further, when the attribute of the category is a scene, the encoding process of the category includes:

calculating three-channel color characteristics of an image to be processed and texture characteristics of local binarization mode coding, and normalizing; dividing an image to be processed into a plurality of super pixels by using a super pixel segmentation method; for each super pixel, dividing [0,1] into 32 equal intervals in each characteristic channel, and counting the values of the pixels contained in the intervals; thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;

calculating the edge characteristic and the saliency characteristic of the image to be processed, and calculating the similarity of every two super pixel pairs;

recording edge distance measurement between every two superpixels;

determining the superpixels including the pixels of the labeling categories, calculating the similarity between other superpixels in the graph and the superpixels, and recording all the superpixels with the similarity larger than 0.5;

for the recorded super pixels, calculating the color feature similarity and the texture feature similarity of every two super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair; dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the class center of the group;

alternately updating the grouping of class centers and superpixels by an E-M method until the grouping is stable;

and recording the G finally obtained class centers as the coding feature group of the scene class.

Further, the calculating the similarity of each super pixel of the training image with each category includes:

calculating semantic features and apparent features of the image, dividing the image into a plurality of super pixel regions, and generating the semantic features and the apparent features of each super pixel for each super pixel region;

respectively calculating the similarity of each super pixel and each category; for each category belonging to the object, calculating the similarity between the semantic features of the superpixel and the semantic features of the category;

and for each category belonging to the scene, calculating the similarity between the apparent feature of the superpixel and each feature vector in the category coding feature group, and recording the maximum value of the similarity as the similarity between the superpixel and the category.

Further, the initial generation process of the supervision seeds comprises:

calculating edge features and saliency features of the image; wherein the saliency features comprise global saliency features and local saliency features of two partition modes, and the edge features and the three saliency features are used for coding the context similarity measurement of each pair of super pixels, so as to form image context information;

generating a saliency feature of each super pixel and an edge distance metric of each pair of super pixels for each super pixel region based on the super pixel division result obtained in the step two;

recording similarity vectors of all superpixels in the image to the object as a matrix form M_objSimultaneously recording similarity vectors of all superpixels in the image to the scene as a matrix form M_sce；

Dividing a driving scene image into four equal areas from top to bottom, and specifying an appearance position range of each category; for each super pixel, only retaining the characteristic dimensions corresponding to the categories in the two similarity vectors according to the object and scene categories defined and contained in the region where the super pixel is located;

recording the maximum value in the two similarity vectors and the category corresponding to the maximum value; for two vectors corresponding to each super pixel, if the maximum value corresponding to the object type is greater than 0.05, recording the super pixel type as the object type; if the maximum value corresponding to the object type is not more than 0.05 and the maximum value corresponding to the scene type is more than 0.05, recording the superpixel type as the scene type; if the two conditions are not met, the super pixel area is not used in training;

recording the category of the superpixel, corresponding the superpixel to the original image, setting all pixels belonging to the position of the superpixel to the same category as the superpixel, obtaining label information of the whole image, and recording the label information as an initial supervision seed.

Further, when training the semantic segmentation network in the fourth step, counting the semantic segmentation result of the pixels in each super-pixel region, and taking the ratio belonging to each category as the similarity of the super-pixel and each category;

and in the iteration process of the step five, replacing the previous similarity with the new similarity obtained in the step four, and alternately iterating the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged.

Accordingly, in order to solve the above technical problems, the present invention further provides the following technical solutions:

a single-pixel annotation based image semantic segmentation system, the system comprising:

the category coding module is used for respectively coding each category by utilizing the apparent characteristics and the semantic characteristics based on the label of each category single pixel and establishing the characteristic expression of each category;

the similarity calculation module is used for performing superpixel division on the training images and calculating the similarity between each superpixel of the training images and each category based on the feature expression of each category;

the initial supervision seed generation module is used for updating a similarity calculation result by using the similarity of each super pixel and each category as an initial condition and using image context information and driving scene position prior to generate an initial supervision seed;

the semantic segmentation network training module is used for training a semantic segmentation network by utilizing the initial supervision seeds, learning the homomorphic characteristics of different examples, providing an image semantic segmentation result and updating the similarity of each superpixel and each category;

an iteration module for iteratively executing the initial supervision seed generation module and the semantic segmentation network training module until the semantic segmentation performance of the semantic segmentation network converges;

and the semantic segmentation network storage module is used for storing the semantic segmentation network obtained by the final one-time training and is used for performing semantic segmentation on the new image.

The technical scheme of the invention has the following beneficial effects:

the image semantic segmentation method based on single pixel labeling provides a lightweight labeling condition, and only one pixel point is labeled for each category; the image semantic segmentation performance is iteratively optimized by alternately realizing the region category similarity calculation based on the image context and a large number of example similarity feature learning processes based on a semantic segmentation network, so that the high-precision segmentation of objects in the image in a driving scene is realized under the condition that each category is only labeled by a single pixel; a feasible strategy is provided for the weak supervision semantic segmentation task in the driving scene, and the application of the feasible strategy in the scenes such as automatic driving and the like has wide prospect.

Drawings

FIG. 1 is a schematic flow chart of a single-pixel labeling-based image semantic segmentation method according to the present invention;

FIG. 2 is a sample diagram for obtaining semantic features of a driving scenario according to the present invention;

FIG. 3 is a schematic flow chart of the present invention for generating the encoding feature or feature group of each category from the single pixel label of the category;

FIG. 4 is a sample diagram of the present invention for extracting edge features, local and global saliency features of an image;

FIG. 5 is a schematic diagram of the area division of the driving scene and the occurrence position range of each category;

FIG. 6 is a graph illustrating a curve of the increase of the segmentation performance with the number of iterations of the image semantic segmentation method based on single pixel labeling according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

First embodiment

Referring to fig. 1 to fig. 6, the present embodiment provides a method for segmenting image semantics based on single pixel labeling, which includes the following steps:

the labeling mode of the category label is as follows: and for a complex driving scene, marking each category with a single pixel, selecting only one training image containing the single pixel from the training image set, marking only one pixel belonging to the category, and setting a task marking condition.

Specifically, for an image semantic segmentation task in a driving scene, if a category label required to be provided in an image includes C categories, then, for a training set including N images, a part of the images is selected in this embodiment, so that all the categories can be covered by at least one sample in the selected image, then, for each category, a pixel is subjected to category labeling in the selected sample, that is, for an image semantic segmentation task including C categories, in the entire training set including N images, this embodiment only selects one pixel belonging to the category for labeling for each category, that is, in the entire training set, only C pixel points belonging to K images (K is less than or equal to C, K is less than or equal to N) have category labeling.

Based on the labeling conditions, each category is divided into two attributes of an object and a scene, and the optimal feature expression of each category is coded. For a driving scene including C types of categories, in this embodiment, first, based on characteristics of each category, instances of vehicles, pedestrians, and the like are defined as objects, non-instances of roads, sky, and the like are defined as scenes, and by analyzing feature expressions of pixels that are labeled singly in each category, two types of attributes are encoded by using different suitable features, and feature expressions of C categories are established. Representing object classes by semantic features; the scene category is represented by apparent features. The semantic feature method comprises the steps of segmenting an image to be processed into a preset number of fragments, then performing feature extraction on each fragment based on a pre-trained category activation mapping network model, finally obtaining a semantic feature map with preset dimensions equal to the full map of the image to be processed, and representing each object as a semantic feature vector; the appearance features employ features that encode color and texture features into 96-and 32-dimensions, respectively, and represent each scene as several sets of color and texture features.

Specifically, the encoding flow adopted in the present embodiment is as shown in fig. 3. Firstly, for each image containing labeled pixels, the type attribute corresponding to each labeled pixel contained in the image is judged, if the image belongs to an object, the method is executed according to the following steps, and the coding feature representing the type is calculated:

s1, utilizing a pre-trained category activation mapping network (CAM) on ImageNet, obtaining a heat map by adopting a multi-overlapping fragment fusion strategy, and encoding semantic features of the image, wherein an example is shown in FIG. 2. First, the present embodiment divides the image into 15 regions of equal size. For an image with length L and width W, each region has a length L/2 and width W/3. The top left pixel of the 15 regions has longitudinal coordinates of (0, L/4, L/2) and transverse coordinates of (0, W/6, W/3, W/2, W2/3). In the embodiment, each region is encoded into a feature map with dimensions of 16 × 16 × 1000 by CAM, and the embodiment normalizes the feature map for a feature dimension with dimension of 1000. Then, for each pixel in the image, calculating the distance between the pixel coordinate and the 15 area fragment center coordinates, and using the 1000-dimensional feature on the position corresponding to the pixel in the fragment closest to the pixel as the semantic heat map response of the pixel;

s2, the image is divided into K superpixels by the superpixel division method. For each super pixel sp_iUsing the average value of the response characteristics of all the pixels contained in the super-pixel as the semantic characteristic of the super-pixel

S3, regarding the marked pixel, the corresponding 1000-dimensional feature vector is taken as the initial center of the category and is marked as

Using (equation 1), will

And

as X_iAnd X_jCalculating the similarity;

s4, selecting the superpixel with the highest similarity with the class center as a set omega_g；

S5, through (formula 2), alternately updating the value of the class center by the E-M method

And the selected set omega_gUntil stable.

S6, recording the finally obtained class center

As a coded signature of the object class.

If it belongs to a scene, it is performed according to the following steps, calculating the set of coding features representing the category:

and S1, calculating three-channel color characteristics of the image and texture characteristics of Local Binary Pattern (LBP) coding, and normalizing. And dividing the image into K superpixels by using a superpixel division method. For each super pixel sp_iAt each featureIn the channel, this example will be [0,1]]The data are divided into 32 equal intervals, and the values of the pixels are counted. Thus, each superpixel will get 96-dimensional color features and 32-dimensional texture features;

and S2, calculating the edge characteristics and the salient characteristics of the image. Specifically, when calculating the saliency features, on one hand, the embodiment directly takes the original image as an object to extract the saliency features, and on the other hand, the embodiment respectively adopts two different modes to segment the image into a plurality of fragments, extract the saliency features for each fragment, and correspondingly splice feature maps of the fragments into a complete map, as shown in the example of fig. 4. In the first mode, the present embodiment divides the image into 5 regions: the upper 1/4 region, the lower 1/4 region, and the remaining middle region, trisected from left to right. In the second mode, the present embodiment divides the long direction of the image into two equal parts, and divides the wide defensive line into four equal parts, resulting in 8 equal-sized fragment areas. Further, the present embodiment calculates two-by-two super pixel pairs sp by using (formula 1)_iAnd sp_jSimilarity of (c), recorded as Sim for global significance_g(i, j) for local saliency recorded as

And

s3, this example records an edge distance measure between two superpixels, defined as: the value with E as the base number and the exponent of 3.5 times the opposite number of the sum of the values of the pixels belonging to the two superpixel center-point connecting lines on the edge feature map is marked as E (i, j). It should be noted here that, on the edge feature map, the larger the pixel value is, the more prominent the edge is, and therefore, in the definition of the present embodiment, the larger the edge distance measure of two superpixels is, the less prominent the edge representing two superpixels directly is.

S4, marking the super pixel containing the marked pixel as sp_annoAnd calculating the similarity between other superpixels in the graph and the superpixel by using (formula 3), and recording all the superpixels with the similarity more than 0.5And (4) element.

And S5, calculating the color feature similarity and the texture feature similarity of every two super pixels by using the (formula 1) of the recorded super pixels, and recording the product of the color feature similarity and the texture feature similarity as the apparent feature similarity of the super pixel pair. Dividing the superpixels into G groups by taking 0.5 as a threshold value, and taking the average characteristic of the superpixels in each group as the initial class center of the group

S6, using (formula 2), alternately updating the value of the class center by E-M method

And grouping of superpixels until stable.

S7, recording the G class centers finally obtained

As a set of coding features for the scene category.

specifically, for each image in the training set, the following steps are taken in the embodiment:

s1, calculating semantic features and apparent features of the image, dividing the image into a plurality of super-pixel regions, and generating the semantic features and the apparent features of each super-pixel in the embodiment by adopting the same method as the step I for each super-pixel region;

s2, calculating each superpixel sp respectively_iSimilarity to objects and scene categories. For each class belonging to the object, the similarity between the semantic feature of the super pixel and the semantic feature of each class is calculated by using (formula 1), and the similarity is compared with all the semantic features of all the classesThe similarity of object classes is recorded as a vector

S3, for each category belonging to the scene, calculating the similarity between the apparent feature of the super pixel and each feature vector in the feature group of the category by using (formula 1), recording the maximum value as the similarity between the super pixel and the category, and finally recording the similarity between the super pixel and all the scene categories as vectors

and S1, calculating the edge characteristics and the salient characteristics of the image. Here, the way of calculating the edge, salient feature and step one are exactly the same.

S2, based on the super-pixel division result in the step two, for each super-pixel area, generating the significance characteristic of each super-pixel and the edge distance measurement of each pair of super-pixels by adopting the same method in the step one;

s3, recording similarity vectors of all K superpixels in the image to the object as a matrix form M_objEach column of the matrix represents the similarity of one superpixel to each object class, and each row represents the similarity of all superpixels to one object class. Processing the scene similarity in the same way to obtain a matrix M_sce. And (3) calculating to obtain a context similarity matrix between every two super pixels.

S4, using (formula 4), updating matrix M_objAnd M_sceTo obtain a matrix

And

s5, for each super pixel sp_iThe embodiment records the updated object similarity vector and scene similarity vector;

s6, in the driving scene, each category has its area in which it appears intensively, and for one image, the present embodiment divides it into four equal areas from top to bottom, and specifies the range of the appearance position of each category, as shown in the schematic diagram 5. For each super-pixel, the present embodiment only retains the feature dimensions corresponding to the object and scene categories in the two similarity vectors according to the object and scene categories defined and contained in the region where the super-pixel is located.

S7, for the two processed vectors, the present embodiment records the maximum value and the object (or scene) class corresponding to the maximum value, and defines the object class of the super pixel as Cls_obj(sp_i) The scene category is Cls_sce(sp_i). Here, for two vectors corresponding to each superpixel, if the maximum value corresponding to the object class is greater than 0.05, recording the superpixel class as the object class; if the maximum value corresponding to the object type is not more than 0.05 and the maximum value corresponding to the scene type is more than 0.05, recording the superpixel type as the scene type; if the two conditions are not met, the super pixel area is not used in training; that is, if the maximum value corresponding to the object (or scene) category is less than 0.05, the present embodiment records the object (or scene) category as 255, that is, it is not used during training.

S8, the present embodiment records the final classification of the superpixel using (equation 5). Further, the superpixels are corresponded to the positions of the superpixels in the image, all pixels belonging to the superpixel positions are set as class labels the same as the superpixels, label information of the whole image is obtained and recorded as initial supervision seeds, and the initial supervision seeds are used for training of the semantic segmentation network.

Training a semantic segmentation network by using the initial supervision seeds, learning the homomorphic characteristics of different instances, and providing an image semantic segmentation result for updating the similarity of each superpixel and each category;

when the semantic segmentation network is trained, the process is realized by utilizing all training images and labels generated by corresponding images in a consistent manner by adopting a semantic segmentation method under the condition of full supervision learning.

The specific training process is as follows:

given N training images I_iAnd label G correspondingly generated by the N pieces obtained in the step three_iTraining a segmentation network f (theta, I) with theta as a parameter, the output of the network representing the probability that the label y of the pixel pix belongs to the class c, i.e. f_pix(θ,I)＝P_pix(y ═ c | I). The loss function adopted in the training is a cross entropy function.

Step five, iteratively executing the step three to the step four until the segmentation performance of the semantic segmentation network is converged;

The specific method comprises the following steps: and for each super-pixel region, counting semantic segmentation results of pixels in the super-pixels, and taking the ratio belonging to each category as the similarity of the super-pixels and each category. It should be noted that, through the context information correction in the third step and the semantic segmentation network homology learning process in the fourth step, the accuracy of the provided class similarity is significantly higher than the initial class similarity provided in the second step.

And repeatedly and alternately executing the third step and the fourth step, and fully fusing the information of the same object based on the image context and different examples between the images to obtain more and more accurate semantic segmentation results. Fig. 6 shows a growth curve of the performance of the segmented network of the embodiment on the cityscaps semantic segmentation data set as the number of iterations increases.

And step six, storing the segmentation network obtained by the final training for semantic segmentation of the new image.

It should be noted that after the training process of the first step to the fifth step, in the practical application, only the semantic segmentation network obtained at the end of the fifth step needs to be used for semantic segmentation inference of the new image, so that the algorithm has higher efficiency in the practical application.

Second embodiment

The embodiment provides an image semantic segmentation system based on single pixel annotation, which includes:

The image semantic segmentation system based on single pixel labeling of the embodiment corresponds to the image semantic segmentation method based on single pixel labeling of the first embodiment; the functions realized by the functional modules in the image semantic segmentation system based on single pixel labeling of the embodiment correspond to the flow steps in the image semantic segmentation method based on single pixel labeling one to one; therefore, it is not described herein.

Furthermore, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A semantic segmentation method for an image based on single pixel labeling is characterized by comprising the following steps:

step five, iteratively executing the step three to the step four until the semantic segmentation performance of the semantic segmentation network is converged; storing the semantic segmentation network obtained by the final training for carrying out semantic segmentation on the new image;

attributes of the categories include objects and scenes; when the categories are coded, the object categories are represented by semantic features; represented by apparent features for scene categories; the semantic features are that an image to be processed is segmented into a preset number of fragments, then feature extraction is carried out on each fragment based on a pre-trained category activation mapping network model, finally, a semantic feature map with preset dimensions equal to the full map of the image to be processed is obtained, and each object is represented as a semantic feature vector; the apparent features adopt features of respectively encoding color features and texture features into 96-dimension and 32-dimension features, and each scene is represented into a plurality of groups of color features and texture features;

when the attribute of the category is the object, the encoding process of the category comprises the following steps:

Computing

And

the similarity of (2);

will be mixed with

Alternately updating by E-M method

And Ω g until stable;

recording the final result

As a coded feature of the object class;

when the attribute of the category is a scene, the encoding process of the category comprises the following steps:

recording edge distance measurement between every two superpixels;

2. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein labels of single pixels in each category are labeled in a manner that: for each class, only one training image containing the class is selected from the training image set, and only one pixel belonging to the class is labeled.

3. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein the calculating the similarity of each super-pixel of the training images to each category comprises:

4. The method for semantic segmentation of images based on single-pixel labeling according to claim 1, wherein the generation process of the initial supervision seed comprises:

5. The image semantic segmentation method based on single pixel labeling according to claim 1, wherein when training the semantic segmentation network in the fourth step, the semantic segmentation result of the pixels in each super-pixel region is counted, and the ratio belonging to each category is used as the similarity between the super-pixel and each category;

6. An image semantic segmentation system based on single pixel labeling, comprising:

the semantic segmentation network storage module is used for storing a semantic segmentation network obtained by the final training and is used for carrying out semantic segmentation on a new image;

Computing

And

the similarity of (2);

will be mixed with

Alternately updating by E-M method

And Ω g until stable;

recording the final result

As a coded feature of the object class;

recording edge distance measurement between every two superpixels;