CN111523548A

CN111523548A - Image semantic segmentation and intelligent driving control method and device

Info

Publication number: CN111523548A
Application number: CN202010331448.7A
Authority: CN
Inventors: 李祥泰; 程光亮; 李夏; 石建萍
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-11
Anticipated expiration: 2040-04-24
Also published as: CN111523548B

Abstract

The present disclosure provides an image semantic segmentation method and apparatus, including: performing first feature extraction on an image to be segmented to obtain an original feature map; generating an offset characteristic diagram corresponding to the original characteristic diagram based on the original characteristic diagram, wherein the value of each characteristic point in the offset characteristic diagram represents the value of the characteristic point needing to be offset at the position corresponding to the position of the characteristic point in the original characteristic diagram; generating an edge feature map and a main body feature map corresponding to the image to be segmented based on the offset feature map and the original feature map; the edge feature map comprises object edge features in the image to be segmented, and the main feature map comprises object main features in the image to be segmented; and generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature image and the main body feature image.

Description

Image semantic segmentation and intelligent driving control method and device

Technical Field

The disclosure relates to the technical field of computers, in particular to a method and a device for controlling image semantic segmentation and intelligent driving.

Background

Image semantic segmentation is an important component in image processing and machine vision techniques. Semantic segmentation is to classify each pixel point in the image, determine the category (such as belonging to the background, people or vehicles) of each point, and thus perform region division. At present, semantic segmentation is widely applied to scenes such as automatic driving and unmanned aerial vehicle drop point judgment.

In the related art, when performing semantic segmentation of an image, the segmented image is generally segmented directly by a neural network, but since the neural network has a limited receptive field, two parts belonging to the same object may be segmented into different categories, thereby affecting the segmentation result.

Disclosure of Invention

The embodiment of the disclosure at least provides an image semantic segmentation and intelligent driving control method and device.

In a first aspect, an embodiment of the present disclosure provides an image semantic segmentation method, including:

performing first feature extraction on an image to be segmented to obtain an original feature map;

generating an offset feature map corresponding to the original feature map based on the original feature map, wherein the value of each feature point in the offset feature map represents a value which needs to be offset for the feature point at a position corresponding to the position of the feature point in the original feature map;

generating an edge feature map and a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map; the edge feature map comprises object edge features in the image to be segmented, and the main feature map comprises object main features in the image to be segmented;

and generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map.

Here, the feature of the regions belonging to the same object are similar, the value of each feature point in the offset feature map determined according to the original feature map indicates a value that the feature point at the position corresponding to the feature point in the original feature map needs to be offset, and after the feature point in the original feature map is offset by the offset feature map, the feature of the regions belonging to the same object in the original feature map can be more aggregated, so that the feature points included in the feature region belonging to the same object are more comprehensive, and the feature part belonging to the edge (i.e., the edge feature map) and the feature part belonging to the main body (i.e., the main body feature map) in the original feature map can be distinguished according to the offset feature map and the original feature map; in addition, after the feature points in the original feature map are shifted by the shifted feature map, the regional features belonging to the same object in the original feature map can be more concentrated, which is equivalent to enlarging the receptive field of the neural network in the process of generating the main feature map, so that the semantic segmentation image generated based on the edge feature map and the main feature map has higher precision.

In a possible implementation manner, the generating, based on the original feature map, an offset feature map corresponding to the original feature map includes:

performing feature extraction on the original feature map to generate a depth feature map corresponding to the original feature map;

and generating an offset feature map corresponding to the original feature map according to the original feature map and the depth feature map.

Here, since the features of the regions belonging to the same target object in the image to be segmented should be similar, the depth feature map generated by feature extraction on the original feature map includes the high-level features in the original feature map, that is, the high-level features belonging to the same target object, and the offset feature map generated according to the original feature map and the depth feature map includes both the features in the original feature map and the high-level features belonging to the same target object in the original feature map, therefore, feature points in the original feature map are controlled to be offset based on the offset feature map, so that feature points belonging to the same target object can be aggregated, and thus, a main feature part is extracted from the original feature map.

In a possible implementation manner, the generating a depth feature map corresponding to the original feature map based on the original feature map includes:

and performing down-sampling processing on the original feature map, and performing up-sampling processing on the feature map subjected to the down-sampling processing to obtain a depth feature map corresponding to the original feature map.

In a possible implementation, the generating an offset feature map corresponding to the original feature map according to the original feature map and the depth feature map includes:

and cascading the original feature map and the depth feature map, and extracting features of the cascaded feature map to obtain the offset feature map.

In a possible implementation manner, the generating an edge feature map and a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map includes:

generating a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map;

and generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main body feature map.

In a possible implementation manner, the generating a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map includes:

shifting each feature point in the original feature map according to a value needing shifting corresponding to the feature point in the shifting feature map to obtain an intermediate feature map corresponding to the original feature map;

and performing bilinear difference calculation on the value of each feature point in the intermediate feature map according to the weight of the corresponding position in the offset feature map to obtain a main feature map corresponding to the original feature map.

In a possible implementation manner, the generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main feature map includes:

and subtracting the values of the feature points at the corresponding positions of the original feature map and the main feature map, and generating an edge feature map corresponding to the image to be segmented according to the value obtained after subtraction.

In a possible embodiment, the method further comprises:

performing second feature extraction on the image to be segmented to obtain a low-level feature map; the convolution times corresponding to the low-level feature map are smaller than those corresponding to the original feature map;

the step of subtracting the values of the feature points at the corresponding positions of the original feature map and the main feature map, and generating an edge feature map corresponding to the image to be segmented according to the value obtained after subtraction includes:

subtracting the values of the feature points of the original feature map and the main feature map at the corresponding positions to obtain an initial edge feature map;

and cascading the low-level feature map and the initial edge feature map, and performing feature extraction on the cascaded feature map to obtain an edge feature map corresponding to the image to be segmented.

In the low-level feature map, the edge features of the target object in the original image are more obvious, and the low-level feature map and the initial edge feature map are cascaded, so that the edge features in the initial edge feature map can be supplemented, and the edge identification precision of the edge feature map is improved.

In a possible implementation manner, the generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map includes:

adding the values of the feature points of the edge feature map and the main body feature map at the corresponding positions to obtain a semantic feature map corresponding to the image to be segmented;

and performing convolution operation on the semantic feature map to obtain a semantic segmentation image corresponding to the image to be segmented.

In a possible implementation manner, the semantic segmentation image is obtained by processing the image to be segmented through a neural network;

the neural network is obtained by training by adopting the following method:

acquiring a sample image with first labeling information and second labeling information, wherein the first labeling information is a label added to a pixel region of a target object in the sample image, and the second labeling information is a label added to an edge of the target object in the sample image;

inputting the sample image into the neural network to obtain an edge feature map, a main feature map and a semantic feature map corresponding to the sample image;

determining a predicted edge image corresponding to the sample image based on the edge feature map; determining a prediction subject image corresponding to the sample image based on the subject feature map; determining a predicted semantic segmentation image corresponding to the sample image based on the semantic feature map;

and determining a loss value in the training process based on the predicted edge image, the predicted main body image, the predicted semantic segmentation image and the first annotation information and the second annotation information of the sample image, and training the neural network based on the loss value.

In the training process, the edge feature part and the main body feature part of the original feature map are separately supervised, so that the method can carry out targeted training compared with the method for carrying out supervised training by adding all loss values together, and the neural network trained by the method has higher segmentation precision.

In a possible embodiment, the determining a loss value in the present training process based on the predicted edge image, the predicted subject image, the predicted semantic segmentation image, and the first annotation information and the second annotation information of the sample image includes:

determining a first loss value based on first annotation information of the prediction subject image and the sample image; and the number of the first and second groups,

determining a second loss value based on the predicted edge image, the first annotation information and the second annotation information of the sample image; and the number of the first and second groups,

determining a third loss value based on the first annotation information of the prediction semantic segmentation image and the sample image;

and determining a loss value in the training process based on the first loss value, the second loss value and the third loss value.

In one possible embodiment, the determining a second loss value based on the predicted edge image, the first annotation information of the sample image, and the second annotation information includes:

determining a first edge prediction loss in the training process based on the predicted edge image and second labeling information of the sample image; determining a second edge prediction loss in the training process based on the predicted edge image and the first marking information of the sample image;

and carrying out weighted summation on the first edge prediction loss and the second edge prediction loss to obtain a second loss value.

Here, the second loss value includes a loss of the edge prediction accuracy and a loss of the edge semantic prediction based on the prediction as an edge, and the edge prediction and the edge semantic prediction can be optimized separately when the neural network is trained by using the loss.

In one possible embodiment, the determining the loss value in the current training process based on the first loss value, the second loss value, and the third loss value includes:

and carrying out weighted summation on the first loss value, the second loss value and the third loss value to obtain a loss value in the training process.

In a second aspect, an embodiment of the present disclosure further provides an intelligent driving control method, including:

acquiring an image acquired by a driving device in the driving process;

semantically segmenting the image by a semantic segmentation method based on the image as described in the first aspect or any one of the possible embodiments of the first aspect;

and controlling the driving device based on the semantic segmentation result.

In a third aspect, an embodiment of the present disclosure further provides an image semantic segmentation apparatus, including:

the characteristic extraction module is used for carrying out first characteristic extraction on the image to be segmented to obtain an original characteristic image;

a first generating module, configured to generate, based on the original feature map, an offset feature map corresponding to the original feature map, where a value of each feature point in the offset feature map indicates a value that a feature point at a position corresponding to a position of the feature point in the original feature map needs to be offset;

a second generation module, configured to generate an edge feature map and a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map; the edge feature map comprises object edge features in the image to be segmented, and the main feature map comprises object main features in the image to be segmented;

and the image segmentation module is used for generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map.

In one possible implementation manner, when generating the offset feature map corresponding to the original feature map based on the original feature map, the first generating module is configured to:

In a possible implementation manner, when performing feature extraction on the original feature map to generate a depth feature map corresponding to the original feature map, the first generating module is configured to:

In a possible implementation manner, when generating the offset feature map corresponding to the original feature map according to the original feature map and the depth feature map, the first generating module is configured to:

In a possible implementation manner, when generating the edge feature map and the main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map, the second generating module is configured to:

In a possible implementation manner, when generating the main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map, the second generating module is configured to:

In a possible implementation manner, when generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main feature map, the second generating module is configured to:

In a possible implementation, the feature extraction module is further configured to:

the image segmentation module is configured to, when subtracting values of feature points at corresponding positions of the original feature map and the main feature map, and generating an edge feature map corresponding to the image to be segmented according to the value obtained by the subtraction, be configured to:

In a possible implementation manner, when generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map, the image segmentation module is configured to:

the apparatus further comprises a training module for training the neural network according to the following method:

In one possible embodiment, the training module, when determining the loss value in the training process based on the predicted edge image, the predicted subject image, the predicted semantic segmentation image, and the first annotation information and the second annotation information of the sample image, is configured to:

In one possible embodiment, the training module, when determining the second loss value based on the predicted edge image, the first label information of the sample image, and the second label information, is configured to:

In one possible embodiment, when determining the loss value in the current training process based on the first loss value, the second loss value, and the third loss value, the training module is configured to:

In a fourth aspect, an embodiment of the present disclosure further provides an intelligent driving control device, including:

the acquisition module is used for acquiring images acquired by the driving device in the driving process;

an image segmentation module, configured to perform semantic segmentation on the image by using an image semantic segmentation method according to the first aspect or any one of the possible embodiments of the first aspect;

and the control module is used for controlling the running device based on the semantic segmentation result.

In a fifth aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any one of the possible implementations of the first aspect, or performing the steps of the second aspect described above.

In a sixth aspect, this disclosed embodiment further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, performs the steps in the first aspect, or any one of the possible implementations of the first aspect, or performs the steps in the second aspect.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of an image semantic segmentation method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart for generating an offset feature map corresponding to an original feature map provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a method for generating an edge feature map and a body feature map according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an image semantic segmentation method provided by an embodiment of the present disclosure;

fig. 5 is a schematic flow chart illustrating a training method of a neural network provided in an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a neural network training process provided by an embodiment of the present disclosure;

fig. 7 is a flowchart illustrating an intelligent driving control method according to an embodiment of the disclosure;

fig. 8 is a schematic diagram illustrating an architecture of an image semantic segmentation apparatus provided in an embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating an architecture of an intelligent driving control device provided in an embodiment of the present disclosure;

fig. 10 shows a schematic structural diagram of a computer device 1000 provided by an embodiment of the present disclosure;

fig. 11 shows a schematic structural diagram of a computer device 1100 provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

In the related art, when performing semantic segmentation of an image, the image to be segmented is generally directly segmented by a neural network, but since the neural network has a limited receptive field, two portions belonging to the same object may be segmented into different categories, for example, a wheel and a body portion of a vehicle may be segmented into different two categories.

In addition, when performing semantic segmentation on an image, down-sampling the image to be segmented is required to perform feature extraction on the image to be segmented, and in the process of down-sampling, edge information of an object in the image to be segmented may be lost, so that a segmentation result of an object edge in a final semantic segmentation result is affected.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, an image semantic segmentation method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the image semantic segmentation method provided in the embodiments of the present disclosure is generally an electronic device with certain computing capability, and the electronic device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the electronic method may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of an image semantic segmentation method provided in the embodiment of the present disclosure is shown, where the method includes steps 101 to 104, where:

step 101, performing first feature extraction on an image to be segmented to obtain an original feature map.

102, generating a shift feature map corresponding to the original feature map based on the original feature map, wherein the value of each feature point in the shift feature map represents a value that the feature point at a position corresponding to the position of the feature point in the original feature map needs to be shifted.

103, generating an edge feature map and a main body feature map corresponding to the image to be segmented based on the offset feature map and the original feature map; the edge feature map comprises object edge features in the image to be segmented, and the main feature map comprises object main features in the image to be segmented.

And 104, generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map.

The method comprises the steps of generating an offset feature map based on an original feature map of an image to be segmented, generating an edge feature map and a main feature map corresponding to the image to be segmented according to the offset feature map and the original feature map, and generating a semantic segmentation image corresponding to the image to be segmented according to the edge feature map and the main feature map.

The value of each feature point in the offset feature map determined according to the original feature map indicates a value that the feature point at a position corresponding to the position of the feature point in the original feature map needs to be offset, and after the feature point in the original feature map is offset through the offset feature map, the region features belonging to the same object in the original feature map can be more aggregated, so that the feature points contained in the feature region belonging to the same object are more comprehensive, and the feature part belonging to the edge (namely, the edge feature map) and the feature part belonging to the main body (namely, the main body feature map) in the original feature map can be distinguished according to the offset feature map and the original feature map; in addition, after the feature points in the original feature map are shifted by the shifted feature map, the regional features belonging to the same object in the original feature map can be more concentrated, which is equivalent to enlarging the receptive field of the neural network in the process of generating the main feature map, so that the semantic segmentation image generated based on the edge feature map and the main feature map has higher precision.

The following is a detailed description of the above steps 101 to 104, and it should be noted that the methods in the above steps 101 to 104 are all executed by a neural network, and the semantic segmentation image is obtained by processing the image to be segmented by the neural network.

For step 101,

The neural network comprises a convolutional neural subnetwork, the first feature extraction is performed on the image to be segmented, and the image to be segmented can be input into the convolutional neural subnetwork in the neural network to perform a plurality of convolution operations. In one possible implementation, the convolutional neural subnetwork may be a depth residual network resnet.

With respect to step 102,

When generating the offset feature map corresponding to the original feature map based on the original feature map, reference may be made to the method shown in fig. 2, which includes the following steps:

step 201, performing feature extraction on the original feature map to generate a depth feature map corresponding to the original feature map.

In specific implementation, when feature extraction is performed on the original feature map to generate a depth feature map corresponding to the original feature map, down-sampling processing may be performed on the original feature map to extract a high-level feature corresponding to the original feature map, and then up-sampling processing may be performed on the down-sampled feature map to obtain the depth feature map corresponding to the original feature map.

The feature map after the down-sampling process is subjected to the down-sampling process, and the size of the feature map after the down-sampling process is smaller than that of the original feature map, so that the feature map after the down-sampling process can be subjected to the up-sampling process, at this time, the features contained in the feature map after the up-sampling process are the same as those contained in the feature map after the down-sampling process, but the size of the feature map after the up-sampling process is consistent with that of the original feature map, and the feature map after the up-sampling process is the depth feature map corresponding to the original feature map.

Step 202, generating an offset feature map corresponding to the original feature map according to the original feature map and the depth feature map.

In specific implementation, when the offset feature map corresponding to the original feature map is generated according to the original feature map and the depth feature map, the original feature map and the depth feature map may be cascaded first, and then feature extraction may be performed on the cascaded feature maps to obtain the offset feature map.

When the original feature map and the depth feature map are cascaded, for example, if the size of the original feature map is H × W × C, and the size of the depth feature map is the same as that of the original feature map and is also H × W × C, the size of the cascaded feature map is H × W × 2C, and then feature extraction is performed on the cascaded feature map to obtain an offset feature map, which is also the same as that of the original feature map and is H × W × C; where H × W represents the length and width, and C represents the number of channels.

The feature extraction of the concatenated feature map may be a convolution operation of the concatenated feature map, where the size of a convolution kernel is preset, and the size of the convolution kernel may be adjusted according to actual conditions in specific implementation.

For step 103,

In an example of the present disclosure, when generating an edge feature map and a main feature map corresponding to an image to be segmented based on an offset feature map and an original feature map, reference may be made to the method shown in fig. 3, which includes the following steps:

step 301, generating a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map.

Specifically, when the main feature map corresponding to the image to be segmented is generated based on the offset feature map and the original feature map, each feature point in the original feature map may be offset according to a value, which needs to be offset, corresponding to the feature point in the offset feature map to obtain an intermediate feature map corresponding to the original feature map, and then the value of each feature point in the intermediate feature map is subjected to bilinear difference calculation according to the weight of the corresponding position in the offset feature map to obtain the main feature map corresponding to the original feature map.

Because the sizes of the offset feature map and the original feature map are the same, each position in the original feature map is in one-to-one correspondence with the position in the offset feature map, the value of each position in the offset feature map comprises a value which needs to be offset and corresponds to the feature point at the position in the original feature map, and each feature point in the original feature map can be offset according to the value which needs to be offset and corresponds to the position of the feature point.

Illustratively, if the coordinate of position A in the original feature map is (x)₁，y₁)，The position corresponding to the position A in the offset characteristic diagram is a position B, and the value of the position B needing offset is (x)₂，y₂) Then the value of position A is shifted to the coordinate (x)₁+x₂，y₁+y₂) The corresponding position.

After the characteristic points are shifted, the characteristic points corresponding to the same object area are gathered together, so that the characteristic points belonging to the main body part in the original characteristic diagram can be further gathered by shifting the characteristic diagram.

The value of each position in the offset characteristic diagram also comprises a weight besides the value needing offset, wherein the weight is the weight of each position point when bilinear difference is carried out. Specifically, when the value of each feature point in the intermediate feature map is calculated as the bilinear difference according to the weight of the corresponding position in the offset feature map, the following formula may be referred to:

wherein, F_bodyShowing a subject feature diagram, F_body(P_x) Representing a feature point P in a body feature map_xValue of (a), ω_pThe weight of the p-th feature point in the offset diagnostic chart is represented, n (pl) represents the feature point (generally, four feature points) adjacent to the feature point pl in the intermediate feature chart obtained by offsetting the original feature chart, and f (p) represents the value of the p-th feature point in the intermediate feature chart.

And 302, generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main body feature map.

The original feature map includes a main feature and an edge feature, and after the main feature is obtained based on step 301, other features except the main feature in the original feature map may be used as the edge feature.

Specifically, the values of the feature points at the corresponding positions of the original feature map and the main feature map may be subtracted, and an edge feature map corresponding to the image to be segmented is generated according to the value obtained by the subtraction.

In a possible implementation manner, the second feature extraction may be performed on the image to be segmented to obtain a low-level feature map, the convolution times corresponding to the low-level feature map are smaller than those corresponding to the original feature map, and then the edge feature map corresponding to the image to be segmented is generated based on the low-level feature map and the feature map obtained after subtraction.

When the second feature extraction is performed on the image to be segmented, the neural network used in the step 101 when the first feature extraction is performed on the image to be segmented may be used, and the more times of convolution the image to be segmented is, the more prominent the main feature in the image to be segmented is, the weaker the edge feature is, the less the number of times of convolution is, the more prominent the edge feature is, the weaker the main feature is, and the lower-level feature map is used to supplement edge details, so that the number of times of convolution corresponding to the lower-level feature map is smaller than the number of times of convolution corresponding to the original feature map.

In specific implementation, when generating the edge feature map corresponding to the image to be segmented, the values of the feature points of the original feature map and the main feature map at the corresponding positions may be subtracted to obtain an initial edge feature map, then the initial edge feature map and the initial edge feature map are cascaded, and the feature extraction is performed on the cascaded feature maps to obtain the edge feature map corresponding to the image to be segmented.

The feature extraction of the cascaded feature map may be a convolution operation of the cascaded feature map to obtain an edge feature map corresponding to the image to be segmented.

With respect to step 104,

The edge feature map and the main feature map have the same size and are consistent with the size of the original feature map, when a semantic segmentation image corresponding to the image to be segmented is generated based on the edge feature map and the main feature map, the values of feature points of the edge feature map and the main feature map at corresponding positions can be added to obtain the semantic feature map corresponding to the image to be segmented, and then the semantic feature map is subjected to convolution operation to obtain the semantic segmentation image corresponding to the image to be segmented.

The above image semantic segmentation method will be described with reference to the detailed drawings.

Referring to fig. 4, a schematic diagram of an image semantic segmentation method provided in the embodiment of the present disclosure is shown, where the method includes two parts, one part is a generation process of a main feature map, and the other part is a generation part of an edge feature map.

First, a generation process of the subject feature map will be described. F in the graph represents an original feature map corresponding to an image to be segmented, and the original feature map F is firstly subjected to down-sampling processing to obtain the feature map F_lowThen, for the feature map F_lowCarrying out up-sampling treatment to obtain a depth characteristic map F corresponding to the original characteristic map F_α(ii) a Then, the original feature map F and the depth feature map F are combined_αCarrying out cascade (convolution), then carrying out convolution operation through a convolution kernel of 3 x 3, and carrying out feature extraction to obtain an offset feature map; then, according to the original characteristic diagram F and the offset characteristic diagram, performing warping warp operation (namely, firstly performing offset according to a value needing offset, and then performing bilinear difference calculation according to the weight), and finally obtaining a main characteristic diagram F_body。

In the process of generating the edge feature map, the main feature map F is firstly subtracted from the original feature map F_bodyObtaining a feature map F_βThen, the feature map F is_βAnd a low level feature map F_fineCarrying out cascade connection (convolution), and carrying out convolution operation through 1 × 1 convolution kernel after cascade connection to obtain an edge feature map F_edge。

Obtaining a main body characteristic diagram F_bodyAnd edge feature map F_edgeThen, the main body feature diagram F is processed_bodyAnd edge feature map F_edgeAdding the values of the feature points of the corresponding positions to obtain a semantic feature map F_finalThen to semantic feature graph F_finalAnd performing convolution operation through the convolution kernel of 1 x 1 to obtain a semantic segmentation image corresponding to the image to be segmented.

The training process of the neural network used in the above process will be described below. Referring to fig. 5, a schematic flow chart of a training method for a neural network provided in the embodiment of the present disclosure includes the following steps:

step 501, obtaining a sample image with first labeling information and second labeling information, where the first labeling information is a label added to a pixel region of a target object in the sample image, and the second labeling information is a label added to an edge of the target object in the sample image.

The labeling information of the sample image may be to add a label to each pixel point in the sample image, or to add a label to a pixel region of a target object of the sample image, where the target object includes an object in the sample image and a background in the sample image.

In one possible implementation, the same label may be added to pixel regions belonging to the same target object in the sample image, and labels are different between different target objects.

When adding the second annotation information to the sample image, an annotation can be added to each pixel point in the sample image, where the annotation is used to indicate whether the pixel point is an edge pixel point. Illustratively, a 0-1 label may be added to each pixel, where 0 indicates that the pixel is not an edge pixel, and 1 indicates that the pixel is an edge pixel.

Step 502, inputting the sample image into the neural network, and obtaining an edge feature map, a main feature map and a semantic feature map corresponding to the sample image.

Corresponding to the graph shown in fig. 4, after the sample image is input into the neural network, an edge feature graph F corresponding to the sample image can be obtained_edgeMain body feature diagram F_bodyAnd semantic feature map F_final。

Step 503, determining a predicted edge image corresponding to the sample image based on the edge feature map; determining a prediction subject image corresponding to the sample image based on the subject feature map; and determining a predicted semantic segmentation image corresponding to the sample image based on the semantic feature map.

When the predicted edge image corresponding to the sample image is determined based on the edge feature map, performing convolution operation on the edge feature map through a 1-by-1 convolution kernel to obtain the predicted edge image; when the prediction main body image corresponding to the sample image is determined based on the main body feature map, performing convolution operation on the main body feature map through a 1-by-1 convolution kernel to obtain the prediction main body image; and when the predicted semantic segmentation image corresponding to the sample image is determined based on the semantic feature map, performing convolution operation on the semantic feature map through a 1-by-1 convolution kernel to obtain the predicted semantic segmentation image.

Step 504, determining a loss value in the training process based on the predicted edge image, the predicted main body image, the predicted semantic segmentation image and the first annotation information and the second annotation information of the sample image, and training the neural network based on the loss value.

In a specific implementation, when determining a loss value in the training process based on the predicted edge image, the predicted main image, the predicted semantic segmentation image, and the first annotation information and the second annotation information of the sample image, the method may include the following three steps:

determining a first loss value based on first annotation information of the prediction subject image and the sample image.

The first loss value can be used for representing the accuracy of the identification of the main body part in the training process.

(II) determining a second loss value based on the predicted edge image, the first marking information and the second marking information of the sample image.

The second loss value comprises two parts of loss values, wherein one part is the loss value for identifying whether the pixel point is an edge pixel point, and the other part is the loss value for identifying the semantic identification of the edge pixel point.

Specifically, when the second loss value is determined, the first edge prediction loss (i.e., the loss of identifying whether a pixel is an edge pixel) in the training process may be determined based on the second labeling results of the predicted edge image and the sample image; and determining a second edge prediction loss (namely a loss value for recognizing edge pixel point semantic recognition) in the training process based on the first labeling information of the predicted edge image and the sample image, then carrying out weighted summation on the first edge prediction loss and the second edge prediction loss, and taking a summation result as a second loss value.

When determining the second edge prediction loss in the training process, in order to improve the calculation efficiency, the cross entropy loss value corresponding to each pixel point in the predicted edge image may be calculated first, then the order from large to small is performed according to the cross entropy loss values, the K-bit pixel points arranged in front are selected for optimization, then the pixel points of which the corresponding confidence degrees are larger than the preset confidence degree threshold value in the K-bit pixel points are selected as target pixel points, the cross entropy loss value of the target pixel points is calculated, and the calculation result is used as the second edge prediction loss.

Here, because when semantic prediction is performed on edge pixels, because the edge pixels may be pixels located between two objects, exemplarily, if an image includes "person" and "car", the person depends on the outside of the car, then for pixels at the position where the person and the car are handed over in the image, the prediction difficulty of the model is large when predicting.

Based on this, the method provided by the present disclosure calculates the second edge prediction loss by using a difficult sample mining method, and specifically, the second edge prediction loss is calculated by using the loss values of the K-bit pixels with the cross entropy loss arranged in front, the higher the cross entropy loss value corresponding to the pixel point is, the lower the accuracy of model prediction may be, the pixel points (i.e., difficult samples) with the higher cross entropy loss value are selected from the plurality of pixel points of the sample image, and the second edge prediction loss value is calculated based on these selected pixel points, when the network parameters are adjusted based on the second edge prediction loss value, the identification accuracy of the difficult samples can be enhanced, thereby improving the identification accuracy of the neural network for the edge prediction.

For example, the second edge prediction loss may be calculated according to the following formula:

k represents the number of pixel points to be optimized and is a preset value; n represents the number of pixel points in the sample image;

representing the weight of the ith pixel point as a preset value; 1[]As an indicator function, when the value of [ 2 ], []When the condition (1) is satisfied, the value is 1]In the case of the condition (1), the value is 0;

expressing that the cross entropy loss of the ith pixel point is arranged at the front K positions, namely the ith pixel point belongs to the pixel point to be optimized; sigma (b)_i)＞t_bThe confidence coefficient of the ith pixel point corresponding to the ith pixel point is larger than a preset confidence coefficient threshold value t_bSatisfying the above two conditions, 1[ ]]Is 1, and does not satisfy the above two conditions, 1[ 2 ]]Is 0;

and expressing the probability that the prediction result of the ith pixel point is the labeling result of the ith pixel point.

(III) determining a third loss value based on the prediction semantic segmentation image and the first annotation information of the sample image.

After the first loss value, the second loss value, and the third loss value are calculated by the above method, the first loss value, the second loss value, and the third loss value may be weighted and summed according to their respective weights, so as to obtain the loss value in the current training process.

The training process of the neural network will be described in detail below with reference to specific embodiments.

Referring to fig. 6, in order to illustrate a schematic diagram of a neural network training process provided by the embodiment of the present disclosure, an image to be segmented (may be an RGB image) is first input into a depth residual error network, and an original feature map F and a low-level feature map F corresponding to the image to be segmented are output to obtain the original feature map F and the low-level feature map F corresponding to the image to be segmented_fineThen, the original feature map F is subjected to semantic segmentation through a semantic segmentation algorithm (shown in FIG. 6 by an ASPP module), and the main feature in the original feature map is obtainedThe body characteristic part and the edge characteristic part are divided to obtain a main body characteristic diagram; then adding a low-level feature map F into the segmented edge feature part_fineObtaining an edge feature map, and then respectively determining a first loss value L of the main body feature map_bodyAnd a second loss value L of the edge feature map_edgeDetermining a semantic feature map F according to the edge feature map and the main body feature map_finalThen, determining the semantic feature map F_finalThird loss value L of_finalFinally according to the first loss value L_bodySecond loss value L_edgeAnd a third loss value L_finalAnd calculating a loss value in the training process, and then training the neural network according to the loss value in the training process.

For example, the loss value in the training process may be determined by the following formula:

L＝λ₁L_body+λ₂L_edge+λ₃L_final

wherein λ is₁、λ₂、λ₃Respectively represents the weight corresponding to the first loss value, the second loss value and the third loss value, L_bodyRepresents a first loss value, L_edgeRepresents the second loss value, and L_finalAnd L represents the loss value in the training process.

In the training process, the edge feature part and the main body feature part of the original feature map are separately supervised, so that all loss values are added together for supervised training in comparison, the method provided by the embodiment of the disclosure can perform targeted training, and the neural network trained by the method has higher segmentation precision.

In addition, an embodiment of the present disclosure further provides an intelligent driving control method, and as shown in fig. 7, a flow diagram of the intelligent driving control method provided in the embodiment of the present disclosure is shown, which includes the following steps:

and 701, acquiring an image acquired by the driving device in the driving process.

The traveling device includes, but is not limited to, an autonomous vehicle, a vehicle equipped with an Advanced Driving Assistance System (ADAS), a robot, and the like.

And step 702, performing semantic segmentation on the image.

In a specific implementation, the semantic segmentation may be performed on the image by using the image semantic segmentation method shown in fig. 1.

And step 703, controlling the driving device based on the semantic segmentation result.

When the driving device is controlled, the driving device can be controlled to accelerate, decelerate, turn, brake and the like, or voice prompt information can be played to prompt a driver to control the driving device to accelerate, decelerate, turn, brake and the like.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides an image semantic segmentation apparatus corresponding to the image semantic segmentation method, and because the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the image semantic segmentation method in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 8, which is a schematic structural diagram of an image semantic segmentation apparatus provided in an embodiment of the present disclosure, the apparatus includes: a feature extraction module 801, a first generation module 802, a second generation module 803, an image segmentation module 804, and a training module 805; wherein the content of the first and second substances,

the feature extraction module 801 is configured to perform first feature extraction on an image to be segmented to obtain an original feature map;

a first generating module 802, configured to generate, based on the original feature map, an offset feature map corresponding to the original feature map, where a value of each feature point in the offset feature map indicates a value that a feature point at a position corresponding to a position of the feature point in the original feature map needs to be offset;

a second generating module 803, configured to generate an edge feature map and a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map; the edge feature map comprises object edge features in the image to be segmented, and the main feature map comprises object main features in the image to be segmented;

an image segmentation module 804, configured to generate a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map.

In one possible implementation, the first generating module 802, when generating the offset feature map corresponding to the original feature map based on the original feature map, is configured to:

In one possible implementation manner, when performing feature extraction on the original feature map to generate a depth feature map corresponding to the original feature map, the first generating module 802 is configured to:

In one possible implementation, the first generating module 802, when generating an offset feature map corresponding to the original feature map according to the original feature map and the depth feature map, is configured to:

In a possible implementation manner, when generating the edge feature map and the main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map, the second generating module 803 is configured to:

In a possible implementation manner, when generating the main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map, the second generating module 803 is configured to:

In a possible implementation manner, when generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main body feature map, the second generating module 803 is configured to:

In a possible implementation, the feature extraction module 801 is further configured to:

the image segmentation module 804 is configured to, when the values of the feature points at the corresponding positions of the original feature map and the main feature map are subtracted, and an edge feature map corresponding to the image to be segmented is generated according to the value obtained by the subtraction, be configured to:

In a possible implementation manner, the image segmentation module 804, when generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map, is configured to:

the apparatus further comprises a training module 805, the training module 805 is configured to train the neural network according to the following method:

In one possible embodiment, the training module 805, when determining the loss value in the training process based on the predicted edge image, the predicted subject image, the predicted semantic segmentation image, and the first annotation information and the second annotation information of the sample image, is configured to:

In one possible embodiment, the training module 805, when determining the second loss value based on the predicted edge image, the first label information of the sample image and the second label information, is configured to:

In one possible embodiment, when determining the loss value in the current training process based on the first loss value, the second loss value, and the third loss value, the training module 805 is configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same inventive concept, an intelligent driving control device corresponding to the intelligent driving control method is also provided in the embodiments of the present disclosure, and because the principle of solving the problem of the device in the embodiments of the present disclosure is similar to that of the intelligent driving control method in the embodiments of the present disclosure, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 9, a schematic diagram of an architecture of an intelligent driving control apparatus provided in the embodiment of the present disclosure includes an obtaining module 901, an image segmentation module 902, and a control module 903; specifically, the method comprises the following steps:

the image segmentation module is used for performing semantic segmentation on the image by the image semantic segmentation method provided by the embodiment of the disclosure;

Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 10, a schematic structural diagram of a computer device 1000 provided in the embodiment of the present application includes a processor 1001, a memory 1002, and a bus 1003. The memory 1002 is used for storing execution instructions, and includes a memory 10021 and an external memory 10022; the memory 10021 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 1001 and the data exchanged with the external memory 10022 such as a hard disk, the processor 1001 exchanges data with the external memory 10022 through the memory 10021, and when the computer device 1000 operates, the processor 1001 and the memory 1002 communicate through the bus 1003, so that the processor 1001 executes the following instructions:

Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 11, a schematic structural diagram of a computer device 1100 provided in the embodiment of the present application includes a processor 1101, a memory 1102, and a bus 1103. The storage 1102 is used for storing execution instructions and includes a memory 11021 and an external storage 11022; the memory 11021 is also referred to as an internal memory, and stores temporarily operation data in the processor 1101 and data exchanged with an external memory 11022 such as a hard disk, the processor 1101 exchanges data with the external memory 11022 through the memory 11021, and when the computer device 1100 is operated, the processor 1101 communicates with the memory 1102 through the bus 1103, so that the processor 1101 executes the following instructions:

acquiring an image acquired by a driving device in the driving process;

performing semantic segmentation on the image by using an image semantic segmentation method provided based on the embodiment;

and controlling the driving device based on the semantic segmentation result.

The disclosed embodiment also provides a computer readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program executes the steps of the image semantic segmentation and intelligent driving control method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the image semantic segmentation method provided by the embodiment of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the image semantic segmentation and intelligent driving control method in the above method embodiment, which may be specifically referred to in the above method embodiment and are not described herein again.

The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image semantic segmentation method, comprising:

2. The method according to claim 1, wherein the generating an offset feature map corresponding to the original feature map based on the original feature map comprises:

3. The method according to claim 2, wherein the performing feature extraction on the original feature map to generate a depth feature map corresponding to the original feature map comprises:

4. The method according to claim 2, wherein the generating an offset feature map corresponding to the original feature map according to the original feature map and the depth feature map comprises:

5. The method according to claim 1, wherein the generating an edge feature map and a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map comprises:

6. The method according to claim 5, wherein the generating a main feature map corresponding to the image to be segmented based on the offset feature map and the original feature map comprises:

7. The method according to claim 5, wherein the generating an edge feature map corresponding to the image to be segmented based on the original feature map and the main body feature map comprises:

8. The method of claim 7, further comprising:

9. The method according to claim 1, wherein the generating a semantic segmentation image corresponding to the image to be segmented based on the edge feature map and the main body feature map comprises:

10. The method according to claim 1, wherein the semantic segmentation image is obtained by processing the image to be segmented through a neural network;

the neural network is obtained by training by adopting the following method:

11. The method of claim 10, wherein the determining the loss value in the training process based on the predicted edge image, the predicted subject image, the predicted semantic segmentation image, and the first annotation information and the second annotation information of the sample image comprises:

12. The method of claim 11, wherein determining a second loss value based on the predicted edge image, the first label information and the second label information of the sample image comprises:

13. An intelligent travel control method, characterized by comprising:

acquiring an image acquired by a driving device in the driving process;

semantically segmenting the image by an image semantic segmentation method according to any one of claims 1 to 12;

and controlling the driving device based on the semantic segmentation result.

14. An image semantic segmentation apparatus, comprising:

15. An intelligent travel control device, comprising:

an image segmentation module for semantically segmenting the image by an image semantic segmentation method according to any one of claims 1 to 12;

16. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when a computer device is running, the machine-readable instructions, when executed by the processor, performing the steps of the image semantic segmentation method according to any one of claims 1 to 12 or performing the steps of the intelligent driving control method according to claim 13.

17. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, performs the steps of the image semantic segmentation method according to any one of claims 1 to 12 or performs the steps of the intelligent driving control method according to claim 13.