CN110276765B

CN110276765B - Image panorama segmentation method based on multitask learning deep neural network

Info

Publication number: CN110276765B
Application number: CN201910544228.XA
Authority: CN
Inventors: 白双; 王聪聪; 李沛安
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2021-04-23
Anticipated expiration: 2039-06-21
Also published as: CN110276765A

Abstract

The invention provides an image panorama segmentation method based on a multitask learning deep neural network, which comprises the following steps: inputting the image into a backbone convolution neural network for feature extraction to obtain a corresponding feature map; inputting the feature map into a semantic segmentation network head and a region proposal network head respectively to obtain a semantic segmentation map and a plurality of candidate regions of the image; screening candidate regions according to the semantic segmentation map; inputting the screened candidate areas to an object recognition network head and a boundary box offset prediction network head respectively for classification and boundary box correction; inputting the candidate area after the classification and the correction of the bounding box into an example segmentation network head to obtain an example segmentation graph; fusing the semantic segmentation graph and the example segmentation graph to obtain a panoramic segmentation graph; training and optimizing the panoramic segmentation network through a training optimization mechanism to obtain an optimized image panoramic segmentation model; and carrying out panoramic segmentation on the image. The method can simultaneously complete the panoramic semantic and instance segmentation tasks and reduce the calculation amount.

Description

Image panorama segmentation method based on multitask learning deep neural network

Technical Field

The invention relates to the technical field of computer vision recognition, in particular to an image panorama segmentation method based on a multitask learning deep neural network.

Background

With the continuous deepening of computer vision research and deep learning methods, technologies such as image classification, semantic segmentation and instance segmentation based on deep learning are greatly improved. Semantic segmentation assigns a semantic class label to each pixel in an image, but cannot distinguish different object instances of the same semantic class in an image. Example segmentation pixel-level segmentation of object examples in an image, but does not involve a variety of incomputable objects that do not have an explicit shape. The panoramic segmentation task is the unification of semantic segmentation and instance segmentation tasks, and is very important for tasks such as automatic driving and intelligent robots which depend on image scene visual perception.

The traditional panorama segmentation technology generally independently executes semantic segmentation and instance segmentation tasks, and then fuses the semantic segmentation and the instance segmentation tasks to obtain a panorama segmentation result. The method relies on two independent networks, and the network computation amount is large. Therefore, a multitask network segmentation method which can simultaneously complete panoramic semantic and instance segmentation tasks and reduce the calculation amount is needed.

Disclosure of Invention

The invention provides an image panorama segmentation method based on a multitask learning deep neural network, which aims to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme.

The invention provides an image panorama segmentation method based on a multitask learning deep neural network, which comprises the following steps:

inputting the image into a backbone convolution neural network for feature extraction to obtain a corresponding feature map;

inputting the feature map into a semantic segmentation network head and a region proposal network head respectively to obtain a semantic segmentation map and a plurality of candidate regions of the image;

screening the candidate region according to the semantic segmentation graph;

inputting the screened candidate areas to an object recognition network head and a boundary box offset prediction network head respectively for classification and boundary box correction;

inputting the candidate area after the classification and the correction of the bounding box into an example segmentation network head to obtain an example segmentation graph;

fusing the semantic segmentation graph and the example segmentation graph to obtain an image panoramic segmentation graph;

training and optimizing a panoramic segmentation network through a training optimization mechanism according to the image panoramic segmentation graph to obtain an optimized image panoramic segmentation model;

and carrying out panoramic segmentation on the image according to the optimized image panoramic segmentation model.

Preferably, inputting the feature map into a semantic segmentation network header and a region proposal network header respectively to obtain a semantic segmentation map and a candidate region of the image, comprises:

inputting the characteristic graph into a semantic segmentation network head, and generating pixel-level class prediction through full convolution operation so as to obtain a semantic segmentation graph of the image;

and (3) proposing a network head by inputting the characteristic diagram into an area, generating candidate areas with different sizes and length-width ratios through multiple convolution operations, and obtaining the category of each candidate area and the coordinates of a bounding box of the candidate area.

Preferably, the screening the candidate region according to the semantic segmentation map includes:

determining a region corresponding to the position of the semantic segmentation characteristic map according to the bounding box coordinates of each candidate region;

according to each candidate region, calculating the area of the pixel belonging to the countable object in the semantic segmentation map region corresponding to each candidate region, and further calculating the area ratio of the area to the corresponding candidate region;

and judging whether the area proportion corresponding to the candidate region is within a certain threshold range, and if not, deleting the candidate region.

Preferably, the certain threshold value ranges from 0.5 to 0.7.

Preferably, the method further comprises preliminarily screening the candidate regions before screening the candidate regions according to the semantic segmentation map, and removing candidate regions which do not meet the rules.

Preferably, the step of inputting the screened candidate areas to the object recognition network head and the bounding box deviation prediction network head for classification and bounding box correction respectively comprises:

extracting a feature map of the candidate region corresponding to the screening from the feature map according to the candidate region after the screening;

performing interest area pooling operation on the screened candidate area characteristic graph to obtain a pooled candidate area with a certain size;

inputting the pooled candidate areas to an object identification network head and a boundary box offset prediction network head respectively to obtain the categories of the pooled candidate areas and the coordinate offset of the boundary box;

and correcting the boundary frame of the pooled candidate area according to the category of the pooled candidate area and the coordinate offset of the boundary frame.

Preferably, inputting the candidate area after the classification and the bounding box correction to an example segmentation network head to obtain an example segmentation graph, including:

inputting the feature map and the example area into an example segmentation network head, and executing the same operation as the semantic segmentation network head to obtain an example segmentation binary distribution feature;

and acquiring a target example mask corresponding to each example area, and further generating an example segmentation graph.

Preferably, the fusing the semantic segmentation map and the example segmentation map to obtain an image panorama segmentation map includes:

carrying out convolution operation on the feature maps generated by the backbone network to generate two groups of feature maps which are respectively connected with the semantic segmentation map and the example segmentation map in series;

respectively carrying out convolution operation and sigmoid activation function processing on the semantic segmentation graph and the example segmentation graph which are connected in series to obtain an example segmentation soft threshold distribution characteristic graph and a semantic segmentation soft threshold distribution characteristic graph;

carrying out element-by-element product on the example segmentation soft threshold distribution characteristic graph and the example segmentation graph, and simultaneously carrying out element-by-element product on the semantic segmentation soft threshold distribution characteristic graph and the semantic segmentation graph;

connecting the semantic segmentation graph and the example segmentation graph after element-by-element product operation in series, performing primary fusion on the semantic segmentation graph and the example segmentation graph after connection by adopting convolution operation, then performing feature extraction by using the expansion convolution with different expansion rates, and connecting the extracted results in series;

further fusing the results after the serial connection by adopting convolution operation, and comparing the threshold value of the fused results to obtain a gating value distribution map of 0-1 distribution;

and according to the gating value distribution diagram, selecting and using a semantic segmentation result or an example segmentation result for the 0-1 value of each pixel to obtain a panoramic segmentation diagram.

Preferably, the training optimization mechanism comprises:

1) with L_step-1＝L_seg+L_rpnTraining the semantic segmentation network header and the area proposal network header for an objective function;

2) with L_step-2＝L_cls-m+L_reg+L_insIs a target ofA function to train an object recognition network header, a bounding box offset prediction network header, and an instance segmentation network header;

3) training to generate a rear-end fusion network of the panoramic segmentation graph by taking the two-class cross entropy loss function as a target function;

and summing the target functions in the three steps to obtain a uniform target function, and optimizing the model based on the uniform target function to obtain an optimized panoramic segmentation node model.

Preferably, the backbone convolutional neural network is a hole convolutional structure or an encoding-decoding structure.

According to the technical scheme provided by the image panorama segmentation method based on the multitask learning deep neural network, the unified multitask network is built, image semantic segmentation and example segmentation are simultaneously realized, then panorama segmentation is performed, the execution of the example segmentation task is assisted by the semantic segmentation result, the precision of example segmentation is further improved, high-quality semantic segmentation and example segmentation results can be obtained, and finally the panorama segmentation result is obtained through the fusion of the rear end.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of an image panorama segmentation method based on a multitask learning deep neural network according to an embodiment;

fig. 2 is a schematic structural diagram of an image panorama segmentation method based on a multitask learning deep neural network according to an embodiment;

FIG. 3 is a schematic diagram of an implementation of an image panorama segmentation method based on a multitask learning deep neural network according to an embodiment;

fig. 4 is an implementation schematic diagram of fusion between a semantic segmentation graph and an example segmentation graph provided in the embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

To facilitate understanding of the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the accompanying drawings.

Examples

The panorama segmentation method comprises the following steps:

meaning of panorama segmentation method: the panoramic segmentation is to perform semantic classification and instance ID labeling on each pixel in an image, and for semantic categories corresponding to the countless objects, all pixels belonging to a certain semantic category have the same semantic category label and the same instance ID; and for the semantic categories corresponding to countable objects, pixels belonging to a certain object category have the same semantic category label, and different instance IDs are allocated according to different object instances to which the semantic categories belong.

Fig. 1 is a flowchart of an image panorama segmentation method based on a multitask learning deep neural network provided in this embodiment, fig. 2 is a schematic structural diagram of the image panorama segmentation method based on the multitask learning deep neural network provided in this embodiment, and fig. 3 is an implementation schematic diagram of the image panorama segmentation method based on the multitask learning deep neural network provided in this embodiment, referring to fig. 1, fig. 2, and fig. 3, the method includes the following steps:

and S1, inputting the image into the backbone convolutional neural network for feature extraction to obtain a corresponding feature map.

Preferably, the backbone convolutional neural network is a hole convolutional structure or an encoding-decoding structure. This structure enables the generation of richer semantic information and higher resolution feature maps, thereby enhancing the ability to identify larger or smaller objects to enhance the robustness of identifying larger or smaller objects.

Illustratively, a network structure of an encoding-decoding architecture is adopted as the structure of the backbone convolutional neural network, wherein an encoder is composed of the first four modules of ResNeXt-101, and a decoder part is composed of two stages of decoding modules based on bilinear upsampling and convolutional operation. The backbone convolutional neural network firstly uses an encoder to extract a characteristic map with rich semantics from an image, and then gradually restores spatial information in the characteristic map through a decoder.

S2, the characteristic graph is input into the semantic segmentation network head and the area proposal network head respectively to obtain the semantic segmentation graph and a plurality of candidate areas of the image.

And inputting the characteristic graph to a semantic segmentation network head, and generating pixel-level class prediction through full convolution operation so as to obtain the semantic segmentation graph of the image. The semantic segmentation network head is composed of a full convolution structure, the full convolution structure is composed of two convolution layers, two anti-convolution layers, a 1x1 convolution layer and a softmax layer, the characteristic graph obtains pixel-level class probability prediction after passing through the full convolution structure, and then the semantic segmentation graph of the input image is obtained.

And inputting the characteristic diagram into a region proposing network head, and generating candidate regions with different sizes and length-width ratios and surrounding frame coordinates thereof through multiple convolution operations. The Region Proposal Network header is composed of a Region Proposal Network (RPN), and the feature map is input to the RPN.

Of course, the full convolution structure may be in other forms, and is not limited herein.

The configuration of the semantic segmentation network header and the area proposal network header is not limited to the above-mentioned content, and any other configuration can be used as the configuration of the semantic segmentation network header and the area proposal network header, and is included in the scope of the present invention.

S3, screening the candidate region according to the semantic segmentation graph.

Preferably, before this step, the method further includes preliminarily screening the plurality of candidate regions, and removing candidate regions that do not meet the rule. The process specifically comprises the following steps: firstly, removing candidate regions which are too small and exceed the boundary; secondly, sorting the regions according to the descending category confidence score of each candidate region obtained from the RPN, and screening out a fixed number of partial candidate regions; then, using a Non-Maximum inhibition Non-Maximum Suppression (NMS) algorithm to eliminate overlapped candidate regions; and finally, reserving part of high-score candidate regions according to the category confidence scores.

Screening candidate regions according to the semantic segmentation graph, comprising:

determining a region corresponding to the position of the semantic segmentation characteristic map according to the bounding box coordinates of each candidate region; according to each candidate region, calculating the area of the pixel belonging to the countable object in the semantic segmentation map region corresponding to each candidate region, and further calculating the area ratio of the area to the corresponding candidate region; and judging whether the area proportion corresponding to the candidate region is within a certain threshold range, and if not, deleting the candidate region.

Firstly, determining a region corresponding to the position in the semantic segmentation characteristic map according to the coordinates of each candidate region; then, calculating the area of the pixel belonging to countable objects in the semantic segmentation image region corresponding to each candidate region, specifically, in a semantic segmentation region, if the category of a certain pixel belongs to countable objects, the pixel position is set to be '1', otherwise, the pixel position is set to be '0', and finally, the area of all pixels with the pixel value of '1' in the region is counted; and finally, calculating the area ratio of the area to the corresponding candidate region, and if the area ratio is smaller than a certain threshold T1, discarding the candidate region.

Preferably, the certain threshold value T1 here is in the range of 0.5-0.7.

S4, the screened candidate areas are respectively input to the object recognition network head and the boundary box deviation prediction network head for classification and boundary box correction.

performing Interest area (RI, Region of Interest) pooling operation on the screened candidate area characteristic graph to obtain a pooled candidate area with a certain size; the purpose of this step is to input each candidate region to the full link layer for processing such as classification.

the pooled candidate region bounding boxes are modified according to the category of the pooled candidate region and the coordinate offset of the bounding box, countable object candidate bounding boxes that are determined by the object identification network head to be background are discarded, and remaining candidate bounding boxes are corrected for location based on coordinate offset prediction.

S5, inputting the candidate area after classification and coordinate correction to the example segmentation network head to obtain an example segmentation graph.

Inputting the feature map and the instance region into an instance segmentation network head, performing the same operations as said semantic segmentation network head to obtain an instance segmentation binary distribution feature, the instance segmentation network head using the same structure and sharing parameters as the semantic segmentation network head, except that the semantic segmentation network head generates probability distribution maps for all semantic classes when generating semantic segmentation predictions, and ignores prediction outputs corresponding to non-instance objects and retains only the probability distribution maps corresponding to instance objects when generating instance segmentation predictions. And then acquiring a target instance mask corresponding to each instance area, and further generating an instance segmentation graph.

Further, when an overlapping problem occurs between different instances, the prediction result with high confidence score in the instance segmentation binary distribution feature is selected as the instance segmentation graph.

And S6, fusing the semantic segmentation graph and the example segmentation graph to obtain an image panoramic segmentation graph. Referring to fig. 4, fig. 4 is an implementation schematic diagram of fusing the semantic segmentation graph and the example segmentation graph provided in this embodiment.

There may be a conflict between the instance segmentation output and the semantic segmentation output. In order to obtain a uniform panoramic segmentation result, the semantic segmentation graph and the example segmentation graph need to be fused, and the method specifically comprises the following steps:

s61, carrying out convolution operation on the feature maps generated by the backbone network to generate two groups of feature maps, and respectively connecting the two groups of feature maps in series with the semantic segmentation map and the example segmentation map;

s62, respectively carrying out convolution operation and sigmoid activation function processing on the semantic segmentation graph and the example segmentation graph which are connected in series to obtain an example segmentation soft threshold value distribution feature graph and a semantic segmentation soft threshold value distribution feature graph;

s63, carrying out element-by-element product by using the example segmentation soft threshold distribution feature map and the example segmentation map, and simultaneously carrying out element-by-element product by using the semantic segmentation soft threshold distribution feature map and the semantic segmentation map;

s64, the semantic segmentation graph and the example segmentation graph after element-by-element product operation are connected in series, the semantic segmentation graph and the example segmentation graph after connection are preliminarily fused by convolution operation, feature extraction is carried out by using expansion convolution with different expansion rates, and extracted results are connected in series;

s65, further fusing the results after the serial connection by adopting convolution operation, and comparing the threshold value of the fused results to obtain a gating value distribution map of 0-1 distribution;

s66, selecting and using semantic segmentation or example segmentation result for 0-1 value of each pixel according to the gating value distribution map to obtain a panorama segmentation map.

Preferably, the threshold value in this step is 0.5.

S7, according to the image panorama segmentation graph, training and optimizing the panorama segmentation model through a training optimization mechanism to obtain an optimized image panorama segmentation model.

Because the panorama segmentation simultaneously relates to semantic segmentation and example segmentation, a plurality of basic tasks such as detection, identification, segmentation and the like are covered. The panoramic segmentation network architecture is complex, and in order to obtain the optimal optimization result, the training process of the whole panoramic segmentation model is divided into the following 4 steps by a training optimization mechanism.

The training optimization mechanism comprises:

1) with L_step-1＝L_seg+L_rpnFor the objective function, the semantic segmentation network header and the area proposal network header are trained to minimize the objective function.

Defining a network header and a region suggestion network header loss representing training semanticsLoss multitasking loss function L_step-1Is represented by the following formula (1):

L_step-1＝L_seg+L_rpn (1)

wherein,

defined as a cross-entropy loss function expressed as a semantic segmentation loss, N_IPIs the number of pixels in the image, M is the number of semantic categories, M represents a certain semantic category, yⁱ _mOne-hot notation, p, for pixel iⁱ _mA predicted output for pixel i for the model;

defined as a regional proposed penalty, wherein L_cls-bIs a two-class cross-entropy classification loss function expressed as

i is an index of the candidate proposed region in the image, a_iSuggesting for proposal the predicted probability that region i is a countable object.

Indicating whether the proposed suggested area is a countable object. If so, then

Get 1, otherwise get 0. L is_regPredicting a penalty function for the bounding box offset, in the second term

The bounding box coordinate offset penalty is computed for coefficient representation only for candidate proposed regions corresponding to countable objects, and λ is a weighting coefficient for the offset penalty used to balance the offset penalty and the classification penalty. t is t_iA predictor representing a parameterized 4-dimensional bounding box coordinate offset vector,

is and proposes a proposalThe 4-dimensional coordinate offset of the real bounding box associated with region i. Bounding box coordinate offset prediction is a regression problem, therefore defining L_regIs composed of

j represents a coordinate representation of the candidate region bounding box, where x, y are the top left coordinates of the candidate region bounding box, w, h are the width and height of the candidate region bounding box starting at the top left coordinates, where,

2) with L_step-2＝L_cls-m+L_reg+L_insFor the objective function, an object recognition network header, a bounding box offset prediction network header, and an instance segmentation network header are trained.

The invention adopts the candidate bounding box transferred from the preceding stage to extract the bounding box characteristics from the characteristic diagram, and defines a multitask loss function on each bounding box characteristic as shown in the following formula (2):

L_step-2＝L_cls-m+L_reg+L_ins (2)

wherein,

L_cls-mmultiple class cross entropy loss function for classifying countable objects and backgrounds (where countable objects and backgrounds are both defined as background classes), N_RAs a number of bounding box features, M_insAdding one to the number of countable object categories indicates that all background categories are considered as one category. L is_regWith L in step 1)_regThe loss of the coordinate offset of the predicted boundary box and the coordinate offset of the actual boundary box of the countable object example is defined;

L_insloss value for semantic segmentation of candidate regions, N_RPIs the number of pixels in the candidate region, m is some instance level semantic category, yⁱOne-hot notation, p, for pixel iⁱThe predicted output of the model for pixel i. In calculating L_insIn the process of losing values, only countable object classes and backgrounds are considered.

3) And training to generate a rear-end fusion network of the panoramic segmentation graph by taking the two-class cross entropy loss function as a target function.

The training of the fusion network of the semantic segmentation output and the example segmentation output is characterized in that the fusion network outputs a gating value distribution map of a single channel, wherein the gating value distribution map only comprises two numerical values of 0 and 1, so that the project expresses the semantic-example segmentation gating problem into a binary classification problem, the gating value distribution map obtained through prediction and a binarized image are truly labeled, and the back-end fusion network of the panoramic segmentation graph is trained and generated through calculating a binary classification cross entropy loss function.

4) And summing the target functions in the three steps to obtain a uniform target function, and optimizing the model based on the uniform target function to obtain an optimized panoramic segmentation node model.

S8, carrying out panoramic segmentation on the image according to the optimized image panoramic segmentation model.

Those of ordinary skill in the art will understand that: the drawings are merely schematic representations of one embodiment, and the flow charts in the drawings are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image panorama segmentation method based on a multitask learning deep neural network is characterized by comprising the following steps:

screening the candidate region according to the semantic segmentation graph;

fusing the semantic segmentation graph and the example segmentation graph to obtain an image panoramic segmentation graph, wherein the method comprises the following steps:

according to the gating value distribution diagram, selecting and using a semantic segmentation result or an example segmentation result for the 0-1 value of each pixel to obtain a panoramic segmentation image;

2. The method of claim 1, wherein inputting the feature map into a semantic segmentation network header and a region proposal network header, respectively, to obtain a semantic segmentation map and a plurality of candidate regions of an image comprises:

3. The method of claim 2, wherein said filtering said candidate regions according to said semantic segmentation map comprises:

determining a region corresponding to the position of the semantic segmentation map according to the bounding box coordinates of each candidate region;

according to each candidate region, calculating the area of pixels belonging to countable objects in the semantic segmentation map region corresponding to each candidate region, and further calculating the area ratio of the area to the corresponding candidate region;

4. The method of claim 3, wherein the certain threshold value is in a range of 0.5-0.7.

5. The method of claim 1, further comprising preliminarily filtering the candidate regions to remove candidate regions that do not meet a rule before filtering the candidate regions according to the semantic segmentation map.

6. The method of claim 1, wherein the inputting the filtered candidate areas to the object recognition network header and the bounding box offset prediction network header for classification and bounding box correction, respectively, comprises:

7. The method of claim 1, wherein inputting the candidate areas with the classification and bounding box revisions to an example segmentation network header to obtain an example segmentation map comprises:

8. The method of claim 1, wherein the training optimization mechanism comprises:

wherein L is_step-1A multitask loss function representing losses of the training semantically segmented network header and the area suggestion network header,

i is an index of the candidate proposed region in the image, a_iTo propose a predicted probability that the proposed region i is a countable object,

indicating whether the proposed suggested area is a countable object; if so, then

Get 1, otherwise get 0, L_regPredicting a penalty function for the bounding box offset, in the second term

Calculating bounding box coordinate offset loss for coefficient representation only for candidate proposed regions corresponding to countable objects, λ being a weighting coefficient of offset loss for balancing offset loss and classification loss; t is t_iA predictor representing a parameterized 4-dimensional bounding box coordinate offset vector,

is the 4-dimensional coordinate offset of the real border associated with the proposed suggested region i; bounding box coordinate offset prediction is a regression problem, therefore defining L_regIs composed of

2) with L_step-2＝L_cls-m+L_reg+L_insTraining an object recognition network header, a bounding box offset prediction network header, and an instance segmentation network header for an objective function;

wherein,

L_cls-mfor multi-class cross entropy loss function, N, for classifying countable objects and backgrounds_RAs a number of bounding box features, M_insAdding one to the number of countable object categories, wherein the addition of one indicates that all background categories are regarded as one category; l is_regA loss of predicted bounding box coordinate offset and actual bounding box coordinate offset for defining countable object instances;

L_insloss value for semantic segmentation of candidate regions, N_RPIs the number of pixels in the candidate region, m is some instance level semantic category, yⁱOne-hot notation, p, for pixel iⁱA predicted output for pixel i for the model;

and summing the target functions in the three steps to obtain a uniform target function, and optimizing the panoramic segmentation network based on the uniform target function to obtain an optimized panoramic segmentation model.

9. The method of claim 1, wherein the backbone convolutional neural network is a hole convolutional structure or a coding-decoding structure.