CN113807354A

CN113807354A - Image semantic segmentation method, device, equipment and storage medium

Info

Publication number: CN113807354A
Application number: CN202011592663.9A
Authority: CN
Inventors: 刘欢; 朱翔宇; 王军伟; 吴荣彬
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-12-17
Anticipated expiration: 2040-12-29
Also published as: CN113807354B

Abstract

The application provides an image semantic segmentation method, an image semantic segmentation device, image semantic segmentation equipment and a storage medium, wherein the method is applied to the technical field of data processing and comprises the following steps: obtaining an image to be segmented; extracting the features of the image to be segmented to obtain a plurality of basic feature vectors; carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed; generating a target feature vector according to a plurality of feature vectors to be processed; and classifying the target characteristic vectors to obtain a semantic segmentation result of the image to be segmented. Therefore, the target characteristic vectors generated after the pixel perception and/or the boundary enhancement processing are classified to obtain the semantic segmentation result, the relevance learning and the content understanding between the pixels of the local area of the image are enhanced, the accurate identification of small objects and boundary information is ensured, and the accuracy of the semantic segmentation result of the image is improved.

Description

Image semantic segmentation method, device, equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for semantic segmentation of an image.

Background

Generally, semantic segmentation is one of basic tasks of computer vision, and can be applied to various fields, such as smart agriculture, automatic driving, disease diagnosis, and the like.

In the related technology, the semantic segmentation algorithm based on the convolutional neural network receives more and more attention due to the characteristic of high prediction precision, however, most of the current semantic segmentation network models are realized based on a pixel-by-pixel classification method, the correlation among the characteristics is neglected, further the understanding of the image content is lacked, the accurate identification of small objects and boundary information is not ensured, and the defect of overlarge parameter amount exists, so that the method is not suitable for being applied to mobile equipment.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

The application provides an image semantic segmentation method, device, equipment and storage medium, which are used for realizing classification processing of target characteristic vectors generated after pixel perception and/or boundary enhancement processing to obtain semantic segmentation results, strengthening relevance learning and content understanding between pixels in local regions of an image, ensuring accurate identification of small objects and boundary information, improving accuracy of image semantic segmentation results and solving the technical problems of inaccurate image semantic segmentation and large calculation amount in the prior art.

An embodiment of a first aspect of the present application provides an image semantic segmentation method, including:

acquiring an image to be segmented;

extracting the features of the image to be segmented to obtain a plurality of basic feature vectors;

performing pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed;

generating a target feature vector according to the plurality of feature vectors to be processed;

and classifying the target characteristic vector to obtain a semantic segmentation result of the image to be segmented.

According to the image semantic segmentation method, the image to be segmented is obtained; extracting the features of the image to be segmented to obtain a plurality of basic feature vectors; carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed; generating a target feature vector according to a plurality of feature vectors to be processed; and classifying the target characteristic vectors to obtain a semantic segmentation result of the image to be segmented. Therefore, the target characteristic vectors generated after the pixel perception and/or the boundary enhancement processing are classified to obtain the semantic segmentation result, the relevance learning and the content understanding between the pixels of the local area of the image are enhanced, the accurate identification of small objects and boundary information is ensured, and the accuracy of the semantic segmentation result of the image is improved.

The embodiment of the second aspect of the present application provides another image semantic segmentation apparatus, including:

the first acquisition module is used for acquiring an image to be segmented;

the second acquisition module is used for extracting the features of the image to be segmented to acquire a plurality of basic feature vectors;

the processing module is used for carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed;

the generating module is used for generating a target characteristic vector according to the plurality of characteristic vectors to be processed;

and the third acquisition module is used for classifying the target characteristic vector to acquire a semantic segmentation result of the image to be segmented.

The image semantic segmentation device of the embodiment of the application acquires an image to be segmented; extracting the features of the image to be segmented to obtain a plurality of basic feature vectors; carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed; generating a target feature vector according to a plurality of feature vectors to be processed; and classifying the target characteristic vectors to obtain a semantic segmentation result of the image to be segmented. Therefore, the target characteristic vectors generated after the pixel perception and/or the boundary enhancement processing are classified to obtain the semantic segmentation result, the relevance learning and the content understanding between the pixels of the local area of the image are enhanced, the accurate identification of small objects and boundary information is ensured, and the accuracy of the semantic segmentation result of the image is improved.

An embodiment of a third aspect of the present application provides an electronic device, including: the image semantic segmentation method comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the image semantic segmentation method is realized as set forth in the embodiment of the first aspect of the application.

An embodiment of a fourth aspect of the present application provides a non-transitory computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the image semantic segmentation method as set forth in the embodiment of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product, and when executed by an instruction processor in the computer program product, the method for semantic segmentation of an image provided in the first and second aspects of the present application is performed.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of an image semantic segmentation method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an image semantic segmentation method according to a second embodiment of the present application;

FIG. 3 is an exemplary diagram of a convolutional network according to an embodiment of the present application;

FIG. 4 is an exemplary diagram of pixel-aware network processing according to an embodiment of the present application;

fig. 5 is a schematic flowchart of an image semantic segmentation method according to a third embodiment of the present application;

FIG. 6 is an exemplary diagram of a border enhanced network process according to an embodiment of the present application;

fig. 7 is a schematic flowchart of an image semantic segmentation method according to a fourth embodiment of the present application;

FIG. 8 is an exemplary diagram of semantic segmentation of an image provided by an embodiment of the present application;

FIG. 9 is an exemplary diagram of semantic segmentation of an image provided by an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image semantic segmentation apparatus according to a fourth embodiment of the present application;

fig. 11 is a schematic structural diagram of an image semantic segmentation apparatus according to a fifth embodiment of the present application;

FIG. 12 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

In practical application, semantic segmentation is required to be carried out on images through scenes such as intelligent agriculture, automatic driving and the like, in the related technology, the semantic segmentation algorithm based on the convolutional neural network neglects the correlation among features, further lacks understanding of image contents, and does not guarantee accurate identification of small objects and boundary information.

Aiming at the problems, the application provides an image semantic segmentation method, which is used for obtaining an image to be segmented; extracting features of an image to be segmented to obtain a plurality of basic feature vectors; carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed; generating a target feature vector according to a plurality of feature vectors to be processed; and classifying the target characteristic vectors by utilizing a classification network in the semantic segmentation network to obtain a semantic segmentation result of the image to be segmented.

Therefore, the target characteristic vectors generated after the pixel perception and/or the boundary enhancement processing are classified to obtain the semantic segmentation result, the relevance learning and the content understanding between the pixels of the local area of the image are enhanced, the accurate identification of small objects and boundary information is ensured, and the accuracy of the semantic segmentation result of the image is improved.

An image semantic segmentation method, an apparatus, a device, and a storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of an image semantic segmentation method according to an embodiment of the present application.

The embodiment of the present application is exemplified by that the image semantic segmentation method is configured in an image semantic segmentation apparatus, and the image semantic segmentation apparatus can be applied to any electronic device, so that the electronic device can perform an image semantic segmentation function.

In the embodiment of the application, the semantic segmentation method refers to a method for segmenting a group of semantics into the smallest opposite components in comparison, and describing and analyzing the relationship between the smallest opposite components so as to determine a certain semantic content; pixel sensing may be understood as the acquisition of context information between pixels, the fusion of context information and the processing of context information.

As shown in fig. 1, the image semantic segmentation method may include the following steps:

step 101, obtaining an image to be segmented.

In the embodiment of the application, the image to be segmented can be obtained by selecting and setting according to different application scenes, for example, in an automatic driving application scene, a picture within a preset distance in the driving process of a vehicle shot by a camera is subjected to pixel extraction and other preprocessing to be used as the image to be segmented; for example, in a disease diagnosis application scene, the image to be segmented is a picture taken by medical equipment, and is subjected to preprocessing such as pixel extraction and the like to be used as the image to be segmented.

And 102, extracting the features of the image to be segmented to obtain a plurality of basic feature vectors.

In the embodiment of the application, there are various ways of extracting features of an image to be segmented and obtaining a plurality of basic feature vectors, and as an example, a convolution network in a semantic segmentation network is used to extract features of the image to be segmented and obtain a plurality of basic feature vectors; as another example, a bp (back propagation) neural network in a semantic segmentation network is used to perform feature extraction on an image to be segmented to obtain a plurality of basic feature vectors, and the like, and the setting is selected according to an application scenario.

In the embodiment of the present application, as an example, the semantic segmentation network includes: the image classification method based on the pixel perception network comprises the following steps of performing convolution on a network, performing pixel perception network and/or boundary enhancement network, and performing classification network, thereby introducing the pixel perception network and/or the boundary enhancement network, strengthening the associated learning and content understanding of the semantic segmentation network on pixels in local areas of an image, and promoting the semantic segmentation network to better detect boundary information and small objects.

In the embodiment of the application, the preprocessed image to be segmented is firstly subjected to feature extraction through a convolution network, and the obtained basic feature vector with rich semantics is input into a subsequent network for processing.

In the embodiment of the present application, there are many ways of extracting features of an image to be segmented by using a convolutional network in a semantic segmentation network, and the setting may be selected according to the application scene, which is exemplified as follows.

In a first example, a plurality of convolution networks with different resolutions in a semantic segmentation network are used to perform feature extraction on an image to be segmented respectively, so as to obtain a plurality of basic feature vectors.

In a second example, a plurality of convolution networks with the same resolution in the semantic segmentation network are used to perform feature extraction on an image to be segmented respectively, so as to obtain a plurality of basic feature vectors.

And 103, carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed.

In the embodiment of the present application, there are many ways to perform pixel sensing and/or boundary enhancement processing on a plurality of basic feature vectors, which may be specifically selected according to an actual application scenario, and the following are exemplified:

in a first example, a plurality of basic feature vectors are subjected to pixel perception and/or boundary enhancement processing by utilizing a pixel perception network and/or a boundary enhancement network in a semantic segmentation network to generate a plurality of feature vectors to be processed.

Specifically, there are many ways to perform pixel sensing and/or boundary enhancement processing on a plurality of basic feature vectors by using a pixel sensing network and/or a boundary enhancement network in a semantic segmentation network to generate a plurality of feature vectors to be processed, which are illustrated as follows:

example one, a first basic feature sub-vector and a second basic feature sub-vector of each basic feature vector are obtained; generating a transposed characteristic component corresponding to the second basic characteristic sub-vector according to the second basic characteristic sub-vector; generating a first to-be-weighted feature vector corresponding to each basic feature vector according to the transposed feature components corresponding to the first basic feature sub-vector and the second basic feature sub-vector; and generating a plurality of feature vectors to be processed according to each basic feature vector and the corresponding first feature vector to be weighted.

Example two, a transposed feature vector corresponding to each basic feature vector is obtained; generating a second feature vector to be weighted corresponding to each basic feature vector according to each basic feature vector and the corresponding transposed feature vector; and generating a plurality of feature vectors to be processed according to each basic feature vector and the corresponding second feature vector to be weighted.

In a second example, a deep learning integrated neural network in a semantic segmentation network is used for simultaneously carrying out pixel perception and boundary enhancement processing on a plurality of basic feature vectors to obtain a plurality of feature vectors to be processed.

And 104, generating a target feature vector according to the plurality of feature vectors to be processed.

And 105, processing the target characteristic vector to obtain a semantic segmentation result of the image to be segmented.

In the embodiment of the present application, there are many ways to generate the target feature vector according to a plurality of feature vectors to be processed, which may be specifically selected according to practical applications, for example, as follows:

in a first example, a plurality of feature vectors to be processed are subjected to weighted summation to obtain a target feature vector.

In a second example, a target function is obtained according to a relationship between a plurality of historical to-be-processed feature vectors and a historical target feature vector, and the target feature vector is obtained by processing the plurality of to-be-processed feature vectors through the target function.

Further, the target feature vector is processed to obtain a semantic segmentation result of the image to be segmented, namely, the classification prediction of the image to be segmented is realized, so that the purpose of semantic segmentation is achieved.

Based on the description of the above embodiments, it can be understood that the pixel-aware network or the boundary enhancement network can be selected according to the requirements of the application scenario, and the pixel-aware network and the boundary enhancement network are simultaneously selected to perform the image semantic segmentation processing, and how to perform the image semantic segmentation processing based on the pixel-aware network is described in detail below with reference to fig. 2.

Fig. 2 is a schematic flow chart of an image semantic segmentation method according to a second embodiment of the present application.

As shown in fig. 2, the image semantic segmentation method may include the following steps:

step 201, acquiring an image to be segmented.

Step 202, respectively extracting features of the image to be segmented by using a plurality of convolution networks with different resolutions in the semantic segmentation network, and acquiring a plurality of basic feature vectors.

In the embodiment of the present application, a plurality of convolution networks with different resolutions may be used to respectively perform feature extraction on an image to be segmented, and the semantic segmentation network of the present application has a lighter weight, so that the amount of computation and the complexity are greatly reduced, and therefore, in the embodiment of the present application, the selection of the convolution network may be as shown in fig. 2.

Specifically, as shown in fig. 3, the network includes 6 convolutional layers, the maximum number of channels is only 128, and compared with other networks which use dozens of convolutional layers and hundreds of channels, the network has the characteristic of being lighter in weight, and greatly reduces the amount of calculation and complexity.

Step 203, obtaining a first basic feature sub-vector and a second basic feature sub-vector of each basic feature vector, and generating a transposed feature component corresponding to the second basic feature sub-vector according to the second basic feature sub-vector.

Step 204, generating a first to-be-weighted feature vector corresponding to each basic feature vector according to the transposed feature component corresponding to the first basic feature sub-vector and the second basic feature sub-vector, and generating a plurality of to-be-processed feature vectors according to each basic feature vector and the corresponding first to-be-weighted feature vector.

In this embodiment, each basic feature vector may be subjected to one or more convolution processes to obtain a first basic feature sub-vector and a second basic feature sub-vector, which is described as follows.

In a first example, a convolution network with a first resolution ratio is used for carrying out convolution processing on basic feature vectors for the Nth time to obtain first basic feature sub-vectors; carrying out mth convolution processing on the basic characteristic vector by utilizing a convolution network with a second resolution ratio to obtain a second basic characteristic sub-vector; wherein N and M are positive integers.

In a second example, performing the xth convolution processing on the basic feature vector by using a convolution network with a first resolution to obtain a first basic feature sub-vector; carrying out the Y-th convolution processing on the basic characteristic vector by utilizing a convolution network with the first resolution to obtain a second basic characteristic sub-vector; wherein X and Y are different positive integers.

In the embodiment of the application, in the semantic segmentation task of the image, it is very important to maintain the context relationship between pixels, so as to understand the relationship between pixels inside an object and the relationship between the object and the environment, and thus correctly classify the pixels. The semantic segmentation network comprises a pixel perception network, so that the semantic segmentation network can realize the highest performance with the least calculation amount.

In this embodiment, the first basic feature sub-vector or the second basic feature sub-vector may be further transposed, for example, a transposed feature component corresponding to the second basic feature sub-vector is generated according to the second basic feature sub-vector, then a first to-be-weighted feature vector corresponding to each basic feature sub-vector is generated according to the first basic feature sub-vector and the transposed feature component corresponding to the second basic feature sub-vector, and a plurality of to-be-processed feature vectors are generated according to each basic feature vector and the corresponding first to-be-weighted feature vector.

For example, as shown in fig. 4, a convolution operation is performed one or more times on a basic feature vector a by using a 3 × 3 convolution, so as to obtain a first basic feature sub-vector B and a second basic feature sub-vector C, a matrix multiplication is performed on a transpose (transpose) of the first basic feature sub-vector B and the second basic feature sub-vector C, the feature vector obtained by the matrix multiplication is normalized by, for example, a softmax classification function, so as to obtain a first feature vector S to be weighted, and finally, a feature vector E to be processed is generated according to the basic feature vector a and the corresponding first feature vector S to be weighted.

The calculation formula of the first to-be-weighted feature vector S is as follows:

finally, a calculation formula for generating the feature vector E to be processed according to the basic feature vector a and the corresponding first feature vector S to be weighted is as follows:

E_j＝σ(S_j)+A_j (2)

the sigma initialization is 0, then the weight of the sigma initialization is gradually increased, each pixel in the finally output feature vector E to be processed keeps semantic information of adjacent pixels, the continuity of the features is ensured, and meanwhile, the pixel perception network only has one convolution layer needing to be trained, so that the parameter quantity is extremely small, and the lightweight of the module is ensured.

Step 205, generating a target feature vector according to the plurality of feature vectors to be processed, and processing the target feature vector by using a classification network in the semantic segmentation network to obtain a semantic segmentation result of the image to be segmented.

Furthermore, a classification network in the semantic segmentation network is utilized to process the target feature vector, and a semantic segmentation result of the image to be segmented is obtained, namely, classification prediction of the image to be segmented is realized, so that the purpose of semantic segmentation is achieved.

The image semantic segmentation method of the embodiment of the application acquires an image to be segmented, respectively extracts features of the image to be segmented by using a plurality of convolution networks with different resolutions in the semantic segmentation network, acquires a plurality of basic feature vectors, acquires a first basic feature sub-vector and a second basic feature sub-vector of each basic feature vector, generates a transposed feature component corresponding to the second basic feature sub-vector according to the second basic feature sub-vector, generates a first to-be-weighted feature vector corresponding to each basic feature vector according to the first basic feature sub-vector and the transposed feature component corresponding to the second basic feature sub-vector, generates a plurality of feature vectors to be processed according to each basic feature vector and the corresponding first to-be-weighted feature vector, generates a target feature vector according to the plurality of feature vectors to be processed, and processes the target feature vector by using a classification network in the semantic segmentation network, and obtaining a semantic segmentation result of the image to be segmented. Therefore, relevance learning and content understanding among pixels of local areas of the image are enhanced by introducing the pixel perception network, and accuracy of image semantic segmentation results is improved.

Based on the description of the above embodiments, it can be understood that the pixel-aware network or the boundary enhancement network can be selected according to the requirements of the application scenario, and the pixel-aware network and the boundary enhancement network are simultaneously selected to perform the image semantic segmentation processing, and how to perform the image semantic segmentation processing based on the boundary enhancement network is described in detail below with reference to fig. 5.

Fig. 5 is a schematic flow chart of an image semantic segmentation method according to a third embodiment of the present application.

As shown in fig. 5, the image semantic segmentation method may include the following steps:

step 301, acquiring an image to be segmented.

Step 302, respectively extracting features of the image to be segmented by using a plurality of convolution networks with different resolutions in the semantic segmentation network, and acquiring a plurality of basic feature vectors.

Specifically, as shown in fig. 2, the network includes 6 convolutional layers, the maximum number of channels is only 128, and compared with other networks which use dozens of convolutional layers and hundreds of channels, the network has the characteristic of being lighter in weight, and greatly reduces the amount of calculation and complexity.

Step 303, obtaining a transposed feature vector corresponding to each basic feature vector, and generating a second feature vector to be weighted corresponding to each basic feature vector according to each basic feature vector and the corresponding transposed feature vector.

And 304, generating a plurality of feature vectors to be processed according to each basic feature vector and the corresponding second feature vector to be weighted.

It can be understood that, in general, after the picture is processed by the multi-convolution layer and the pooling layer of the neural network, a certain amount of small object information and boundary information is lost, which easily causes some internal features of the object or the small object to be incorrectly identified as background features, and greatly reduces the performance of the algorithm. Therefore, the boundary enhancement network is added to retain the boundary information, so that the object segmentation is correctly realized, and the loss of the boundary information is reduced.

For example, as shown in fig. 6, the boundary enhancement network is different from the pixel-aware network, and directly multiplies the basic feature vector a by its transpose without any convolution operation, normalizes the obtained feature vector by a classification function such as a softmax function to obtain a second feature vector B to be weighted, and finally adds B to the basic feature vector a to obtain a final output feature vector E to be processed.

The calculation formula of the second feature vector S to be weighted is as follows:

and finally, generating a calculation formula of the feature vector E to be processed according to the basic feature vector A and the corresponding second feature vector S to be weighted as follows:

E_j＝ω(B_j)+A_j (2)

the omega initialization is 0, then the weight of the omega initialization is gradually increased, and the boundary enhancement network does not contain a convolution layer, so that the problem of losing boundary information and small object information does not exist, correct object segmentation is realized, the finally obtained output E ensures the integrity of the small object information and the boundary information, and meanwhile, the boundary enhancement network does not have any convolution layer needing to be learned, and the light weight of the boundary enhancement network is ensured.

And 305, generating a target feature vector according to the plurality of feature vectors to be processed, and processing the target feature vector by using a classification network in the semantic segmentation network to obtain a semantic segmentation result of the image to be segmented.

The image semantic segmentation method comprises the steps of obtaining an image to be segmented, extracting features of the image to be segmented by using a plurality of convolution networks with different resolutions in a semantic segmentation network, obtaining a plurality of basic feature vectors, obtaining a transposed feature vector corresponding to each basic feature vector, generating a second feature vector to be weighted corresponding to each basic feature vector according to each basic feature vector and the corresponding transposed feature vector, generating a plurality of feature vectors to be processed according to each basic feature vector and the corresponding second feature vector to be weighted, generating a target feature vector according to the plurality of feature vectors to be processed, processing the target feature vector by using a classification network in the semantic segmentation network, and obtaining a semantic segmentation result of the image to be segmented. Therefore, accurate identification of small objects and boundary information is guaranteed by introducing a boundary enhancement network, and accuracy of image semantic segmentation results is improved.

Based on the description of the above embodiments, it can be understood that the pixel-aware network or the boundary enhancement network can be selected according to the requirements of the application scenario, and the pixel-aware network and the boundary enhancement network are simultaneously selected to perform the image semantic segmentation processing, and how to simultaneously select the pixel-aware network and the boundary enhancement network to perform the image semantic segmentation processing is described in detail below with reference to fig. 6.

Fig. 7 is a schematic flowchart of an image semantic segmentation method according to a fourth embodiment of the present application.

As shown in fig. 7, the image semantic segmentation method may include the following steps:

step 401, performing convolution processing on the image to be segmented by adopting convolution kernels with different resolutions respectively to obtain a plurality of basic feature vectors.

And step 402, performing content understanding and small object and boundary detection on each basic feature vector by utilizing a pixel perception network and a boundary enhancement network which are connected in parallel in each branch.

And 403, adding and fusing the feature vectors of each branch to calculate loss, acquiring target feature vectors, classifying, and acquiring a semantic segmentation result of the image to be segmented.

In the embodiment of the present application, convolution kernels with different resolutions are, for example, 1 × 1 convolution kernel, 3 × 3 convolution kernel, and 5 × 5 convolution kernel shown in fig. 8, the feature vectors to be processed of each branch are added and fused to obtain a target feature vector, and in the process of adding and fusing, different weight coefficients, such as 0.2, 0.3, and 0.5, may be assigned to different branches according to actual application requirements for calculating loss; wherein, the sum of the weight coefficients corresponding to all branches is 1.

Therefore, the regional content understanding capacity and the small object and boundary detection capacity of the network are realized by utilizing the pixel perception network and the boundary enhancement network which are connected in parallel in each branch, and finally, the characteristics of each branch are added and fused to calculate the loss to obtain the semantic segmentation result, so that the generalization capacity and the identification performance of the semantic segmentation network are improved, meanwhile, each network of the semantic segmentation network has few parameters, and the light weight of the whole network is ensured.

In the embodiment of the present application, one or more semantic segmentation networks may also be combined to perform multitask training, as shown in fig. 9, in the present application, a multitask learning manner is adopted, two semantic segmentation networks are unified in one integrated architecture, and a weighted sum of losses of the two semantic segmentation networks is used as a final loss of a model in a training process, so as to implement end-to-end training, as shown in the following formula:

Loss＝θ·Loss₁+(1-θ)Loss₂ (5)

therefore, a plurality of tasks simultaneously learn and optimize the network, the risk of network overfitting can be effectively reduced, the generalization effect is improved, and the semantic segmentation network can achieve higher performance.

As an example of a scene, taking a large-scale city landscape data set as an example, street scenes of 50 different cities are recorded, 5000 real scene images of driving scenes in a city environment are provided, and the scene has 2975 training sets, 500 verification sets and 1525 test sets, and contains 19 classes of dense pixel labels.

Specifically, based on the data set, the processing is performed through the pixel sensing network and the boundary enhancement network respectively, and the accuracy and the detection efficiency of the semantic segmentation result are determined to be improved to a certain extent.

Specifically, for example, a poly learning strategy (a fading learning rate strategy) is adopted, the basic learning rate is set to 0.001, the parameter power is set to 0.9, an ADAM (adaptive moment estimation) optimizer is adopted, the batch-size (the size of each batch of data) is set to 12, the beta (code table) is set to (0.9, 0.999), and the weight offset is set to 0.0005. In addition, the training set can be subjected to data expansion by adopting cutting, random turning, scale transformation and average interpolation, the image scaling is set to be {0.5,0.75,1,1.25 and 1.5}, and the semantic segmentation network is trained through a cross entropy loss function, so that the accuracy of subsequent recognition of the semantic segmentation network is further improved.

In order to further understand the context understanding ability of the pixel-aware network and the boundary enhancement network for the pixels of the object region and the detection ability for the object boundary and the small object, the pixel-aware network and the boundary enhancement network are not used, only the pixel-aware network or the boundary enhancement network is used, and the pixel-aware network and the boundary enhancement network are simultaneously used for analysis comparison with specific image semantic segmentation data, as shown in table 1.

TABLE 1 semantic segmentation result comparison information

Specifically, in fig. 10, the pixel-aware network and the boundary-enhanced network are not added, the accuracy of the semantic segmentation network is 64.8%, when the pixel-aware network is added alone, the accuracy of the semantic segmentation network is improved by 1.3%, and when the boundary-enhanced network is added alone, the accuracy of the semantic segmentation network is improved by 1.7%, so that both the pixel-aware network and the boundary-enhanced network can play a certain role in the network, and the performance of the semantic segmentation network is improved. In addition, when the pixel-aware network and the boundary enhancement network are used simultaneously, the accuracy of the semantic segmentation network is improved by 2.5%, and thus, the common combination of the two pixel-aware networks and the boundary enhancement network can further enhance the performance of the semantic segmentation network.

The convolutional network, the pixel perception network and/or the boundary enhancement network and the classification network in the semantic segmentation network in the embodiment of the application have the characteristics of small parameter and calculated amount and high detection speed, so that the method is suitable for being deployed on mobile equipment to complete tasks such as auxiliary driving and auxiliary diagnosis and treatment.

Therefore, the pixel perception network and the boundary enhancement network are provided, the context understanding ability of the network for the pixels of the object area and the detection ability for the object boundary and the small object are improved, the pixel perception network and the boundary enhancement network are applied to the semantic segmentation task, and a good effect is achieved. Therefore, all components in the network architecture cooperate with each other and act together, higher performance can be achieved under the condition of few parameter quantity, the network architecture is a practical network taking parameters and accuracy into consideration, and the network architecture is more suitable for being deployed on mobile equipment.

In order to implement the above embodiments, the present application further provides an image semantic segmentation apparatus.

Fig. 10 is a schematic structural diagram of an image semantic segmentation apparatus according to a fourth embodiment of the present application.

As shown in fig. 10, the image semantic segmentation apparatus 1000 may include: a first obtaining module 1010, a second obtaining module 1020, a processing module 1030, a generating module 1040, and a third obtaining module 1050.

The first obtaining module 1010 is configured to obtain an image to be segmented.

The second obtaining module 1020 is configured to perform feature extraction on the image to be segmented, and obtain a plurality of basic feature vectors.

A processing module 1030, configured to perform pixel sensing and/or boundary enhancement processing on the multiple basic feature vectors to generate multiple feature vectors to be processed.

The generating module 1040 is configured to generate a target feature vector according to the multiple feature vectors to be processed.

And a third obtaining module 1050, configured to perform classification processing on the target feature vector, and obtain a semantic segmentation result of the image to be segmented.

Further, in a possible implementation manner of the embodiment of the present application, the second obtaining module 1020 is specifically configured to: and respectively extracting the features of the image to be segmented by utilizing a plurality of convolution networks with different resolutions in the semantic segmentation network to obtain a plurality of basic feature vectors.

Further, in a possible implementation manner of the embodiment of the present application, performing pixel perception and/or boundary enhancement processing on a plurality of basic feature vectors to generate a plurality of feature vectors to be processed includes: and carrying out pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors by utilizing a pixel perception network and/or a boundary enhancement network in the semantic segmentation network to generate a plurality of feature vectors to be processed.

Further, in a possible implementation manner of the embodiment of the present application, the classifying the target feature vector to obtain a semantic segmentation result of the image to be segmented includes: and classifying the target characteristic vectors by utilizing a classification network in the semantic segmentation network to obtain a semantic segmentation result of the image to be segmented.

Further, in a possible implementation manner of the embodiment of the present application, referring to fig. 11, on the basis of the embodiment shown in fig. 10, the processing module 1030 includes: an acquisition unit 1031, a first generation unit 1032, a second generation unit 1033, and a third generation unit 1034.

An obtaining unit 1031, configured to obtain the first basic feature sub-vector and the second basic feature sub-vector of each basic feature vector.

A first generating unit 1032 is configured to generate a transposed feature component corresponding to the second primitive feature sub-vector according to the second primitive feature sub-vector.

A second generating unit 1033, configured to generate a first to-be-weighted feature vector corresponding to each basic feature vector according to the transposed feature component corresponding to the first basic feature sub-vector and the second basic feature sub-vector.

A third generating unit 1034, configured to generate a plurality of feature vectors to be processed according to each basic feature vector and the corresponding first feature vector to be weighted.

Further, in a possible implementation manner of the embodiment of the present application, the first generating unit 1032 is specifically configured to: carrying out convolution processing on the basic characteristic vector for the Nth time by utilizing a convolution network with the first resolution to obtain a first basic characteristic sub-vector; carrying out mth convolution processing on the basic characteristic vector by utilizing a convolution network with a second resolution ratio to obtain a second basic characteristic sub-vector; wherein N and M are positive integers.

Further, in a possible implementation manner of the embodiment of the present application, the processing module 1030 is specifically configured to: acquiring a transposed feature vector corresponding to each basic feature vector; generating a second feature vector to be weighted corresponding to each basic feature vector according to each basic feature vector and the corresponding transposed feature vector; and generating a plurality of feature vectors to be processed according to each basic feature vector and the corresponding second feature vector to be weighted.

It should be noted that the foregoing explanation on the embodiment of the image semantic segmentation method is also applicable to the image semantic segmentation apparatus of the embodiment, and is not repeated here.

In order to implement the above embodiments, the present application also provides an electronic device, including: the image semantic segmentation method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the image semantic segmentation method is realized according to the embodiment of the application.

In order to achieve the above embodiments, the present application further proposes a non-transitory computer-readable storage medium storing a computer program, which when executed by a processor implements the image semantic segmentation method as proposed by the foregoing embodiments of the present application.

In order to implement the foregoing embodiments, the present application also provides a computer program product, which when executed by an instruction processor in the computer program product, executes the image semantic segmentation method as set forth in the foregoing embodiments of the present application.

FIG. 12 illustrates a block diagram of an exemplary server suitable for use in implementing embodiments of the present application. The server 12 shown in fig. 12 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 12, the server 12 is in the form of a general purpose computing device. The components of the server 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

The server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 12, and commonly referred to as a "hard drive"). Although not shown in FIG. 12, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

The server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, the server 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown in FIG. 12, the network adapter 20 communicates with the other modules of the server 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the image semantic segmentation method mentioned in the foregoing embodiments, by running a program stored in the system memory 28.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An image semantic segmentation method, comprising:

acquiring an image to be segmented;

2. The image semantic segmentation method according to claim 1, wherein the extracting features of the image to be segmented to obtain a plurality of basic feature vectors includes:

and respectively extracting the features of the image to be segmented by utilizing a plurality of convolution networks with different resolutions in the semantic segmentation network to obtain a plurality of basic feature vectors.

3. The image semantic segmentation method according to claim 1, wherein the performing pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors to generate a plurality of feature vectors to be processed includes:

and performing pixel perception and/or boundary enhancement processing on the plurality of basic feature vectors by utilizing a pixel perception network and/or a boundary enhancement network in a semantic segmentation network to generate the plurality of feature vectors to be processed.

4. The image semantic segmentation method according to claim 3, wherein the performing pixel-aware processing on the plurality of basic feature vectors by using a pixel-aware network in the semantic segmentation network to generate a plurality of feature vectors to be processed includes:

acquiring a first basic feature sub-vector and a second basic feature sub-vector of each basic feature vector;

generating a transposed feature component corresponding to the second basic feature sub-vector according to the second basic feature sub-vector;

generating a first to-be-weighted feature vector corresponding to each basic feature vector according to the transposed feature components corresponding to the first basic feature sub-vector and the second basic feature sub-vector;

and generating the plurality of feature vectors to be processed according to each basic feature vector and the corresponding first feature vector to be weighted.

5. The image semantic segmentation method according to claim 4, wherein the obtaining of the first primitive feature sub-vector and the second primitive feature sub-vector of each primitive feature vector comprises:

performing convolution processing on the basic feature vector for the Nth time by using a convolution network with a first resolution ratio to obtain a first basic feature sub-vector;

performing convolution processing on the basic feature vector for the Mth time by using a convolution network with a second resolution to obtain a second basic feature sub-vector; wherein N and M are positive integers.

6. The image semantic segmentation method according to claim 3, wherein performing boundary enhancement processing on the plurality of basic feature vectors by using a boundary enhancement network in the semantic segmentation network to generate a plurality of feature vectors to be processed includes:

acquiring a transposed feature vector corresponding to each basic feature vector;

generating a second feature vector to be weighted corresponding to each basic feature vector according to each basic feature vector and the corresponding transposed feature vector;

and generating the plurality of feature vectors to be processed according to each basic feature vector and the corresponding second feature vector to be weighted.

7. The image semantic segmentation method according to claim 1, wherein the classifying the target feature vector to obtain the semantic segmentation result of the image to be segmented comprises:

and classifying the target characteristic vectors by utilizing a classification network in the semantic segmentation network to obtain a semantic segmentation result of the image to be segmented.

8. An image semantic segmentation apparatus, comprising:

the first acquisition module is used for acquiring an image to be segmented;

9. The image semantic segmentation apparatus according to claim 8, wherein the second obtaining module is specifically configured to:

10. The image semantic segmentation apparatus according to claim 8, wherein the processing module is specifically configured to:

11. The image semantic segmentation apparatus according to claim 10, wherein the processing module includes:

an obtaining unit, configured to obtain a first basic feature sub-vector and a second basic feature sub-vector of each basic feature vector;

a first generating unit, configured to generate a transposed feature component corresponding to the second basic feature sub-vector according to the second basic feature sub-vector;

a second generating unit, configured to generate a first to-be-weighted feature vector corresponding to each basic feature vector according to a transposed feature component corresponding to the first basic feature sub-vector and the second basic feature sub-vector;

a third generating unit, configured to generate the multiple feature vectors to be processed according to each basic feature vector and the corresponding first feature vector to be weighted.

12. The image semantic segmentation apparatus according to claim 11, wherein the first generation unit is specifically configured to:

13. The image semantic segmentation apparatus according to claim 10, wherein the processing module is specifically configured to:

14. The image semantic segmentation apparatus according to claim 8, wherein the third obtaining module is specifically configured to:

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image semantic segmentation method according to any one of claims 1 to 7 when executing the program.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method for semantic segmentation of images according to any one of claims 1 to 7.