CN113761976A

CN113761976A - Scene semantic analysis method based on global guide selective context network

Info

Publication number: CN113761976A
Application number: CN202010499367.8A
Authority: CN
Inventors: 刘静; 付君; 徐溢璇
Original assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2021-12-07
Anticipated expiration: 2040-06-04
Also published as: WO2021244621A1; CN113761976B

Abstract

The application discloses a scene semantic analysis method based on a global guide selective context network, which comprises a backbone network, a context selection network and a pixel classification network, wherein the backbone network receives an input data source image and performs feature extraction layer by layer, and inputs primary feature images extracted in different layers to the context selection network; the context selection network obtains fusion weight factors fusing global context and local context at different pixel positions through an attention mechanism guided by global information, and adaptively fuses the global context and the local context for each pixel in the primary feature map according to the weight factors to obtain a secondary feature map with high resolution and robust semantic expression; and finally, the pixel classification network classifies the secondary feature map pixel by pixel to obtain an accurate scene semantic analysis result, so that the scene semantic analysis precision is improved.

Description

Scene semantic analysis method based on global guide selective context network

Technical Field

The application relates to the field of scene semantic analysis, in particular to a scene semantic analysis method based on a global guide selective context network.

Background

Scene semantic analysis is an important technical subject widely applied to the fields of Artificial Intelligence (AI) such as scene understanding, automatic driving and image editing, belongs to a scene semantic segmentation branch in the field of computer vision, and aims to define (classify) each pixel in an input data source image so as to segment different semantic regions of the image. The method has wide application value as the expression of the finest granularity in the semantic understanding of the image. For example, in the aspect of mobile phone photographing, an object of a photographed scene is identified and positioned, an interested region is finely divided, and then subsequent image editing and processing are performed, so that different visual effects are realized. In the field of automatic driving, for the fine recognition of a vehicle driving scene, the detection of a lane line is realized, and a drivable area is determined accordingly; by identifying the traffic signs and the obstacles, the vehicle driving decision can be assisted to avoid the obstacles, and the like. Of course, the method also has wide application in the fields of video analysis, remote sensing image analysis and medical image analysis.

Currently, the main method is a full Convolutional neural Network (FCN), which outputs the probability of each pixel in an image to which the pixel belongs by inputting an image of any size, and obtains an image semantic segmentation result by an end-to-end learning method. FCN has achieved some success in the scene parsing task, but still faces some basic problems, mainly manifested in two aspects: one is that the effective receptive field of the network output characteristics is limited, and the information of the area with discrimination force can not be fully captured, so that the pixel classification in the target area is inconsistent; the other is that continuous down-sampling operation in the network leads output characteristics to lose space detail information, thereby leading to rough target edges or fine targets in the segmentation result to be lost; the existence of the two factors can influence the accuracy of scene semantic parsing.

In addition, due to the influence of factors such as illumination and scale in the input data source image, a certain challenge is provided for accurate classification of image pixels by crossing a semantic gap, and accurate scene analysis is difficult to achieve. For objects that are not significant in the image (e.g., small in scale, etc.), the feature information that can be extracted is very limited due to the small amount of feature information, making segmentation for these small objects particularly difficult. For the object with larger image size, it is difficult to accurately identify the object by only depending on the local context information.

Disclosure of Invention

The application provides a scene semantic analysis method based on a global guide selective context network, which can improve the precision of scene semantic analysis.

In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a scene semantic parsing method based on a global boot selective context network, where the global boot selective context network includes a backbone network, a context selection network, and a pixel classification network, and the method includes: the method comprises the steps that a backbone network receives an input data source image, performs layer-by-layer feature extraction on the input data source image to obtain at least one primary feature map, and inputs the at least one primary feature map into a context selection network; the context selection network carries out attention mechanism based on global information guidance on at least one primary feature map to obtain weight factors fusing global context and local context at different pixel positions of the at least one primary feature map, and adaptively fuses the global context and the local context for each pixel in the at least one primary feature map according to the weight factors to obtain a secondary feature map, and inputs the secondary feature map to the pixel classification network; and the pixel classification network classifies the secondary feature maps one by one to obtain a scene semantic analysis result.

The context selection network can simultaneously consider the utilization requirements of pixel identification at different positions on global context and local context, determine the areas of large-scale targets and small targets in the image through an attention mechanism guided by global information, and accordingly, pertinently fuse the global context information and the local context information, further enable the characteristics of the large-scale targets to obtain larger receptive fields, enable the large-scale targets to be accurately identified and reduce misleading of the local receptive fields, and enable the small targets to obtain the local receptive fields in a emphasizing manner, so that the small targets are more finely segmented and misleading of other large-scale target information is avoided, and finally the precision of scene semantic analysis can be improved.

In a possible implementation of the first aspect, the backbone network is an image classification network, the backbone network includes at least one backbone network module, the backbone network module is configured to output a primary feature map, and the context selection network includes at least one context selection block; the method further comprises the following steps: the method comprises the steps that a backbone network module carries out layer-by-layer feature extraction on an input data source image to obtain and output a primary feature image to a context selection block; and the context selection block performs fusion of global context selection and local context selection on the primary feature map to obtain a secondary feature map.

In a possible implementation of the first aspect, the context network may obtain the selected feature map first and then obtain the high-resolution secondary feature map with robust semantic expression by performing the aforementioned selective fusion of context information on the primary feature map.

In a possible implementation of the first aspect, the image classification network may be a residual network or a variant network thereof, or may be another type of network.

In a possible implementation of the first aspect, the backbone network includes n +1 backbone network modules with different spatial resolutions, the context selection network includes n context selection blocks, and n is a positive integer greater than or equal to 3; the method further comprises the following steps: the 1 st trunk network module outputs a1 st primary feature map according to an input data source image and inputs the 1 st primary feature map to the 2 nd trunk network module and the nth context selection block; the ith trunk network module outputs the ith primary feature map to the next trunk network module and to the corresponding (n + 1-i) context selection block according to the (i-1) th primary feature map, wherein i is more than or equal to 2 and is less than or equal to n; the (n + 1) th trunk network module outputs an (n + 1) th primary feature map according to the nth primary feature map and inputs the (n + 1) th primary feature map to the 1 st context selection block; the 1 st context selection block selects global context and local context for the received n +1 st primary feature diagram and the n primary feature diagram, outputs the 1 st selection feature diagram and inputs the 1 st selection feature diagram to the 2 nd context selection block; the ith context selection block selects the global context and the local context for the received ith-1 selection feature map and the (n + 1) -i primary feature map, outputs the ith selection feature map and inputs the ith selection feature map into the next level context selection block, wherein i is more than or equal to 2 and less than or equal to n-1; the nth context selection block selects global context and local context for the received nth-1 selection feature graph and the 1 st primary feature graph, and outputs the nth selection feature graph as a secondary feature graph.

In a possible implementation of the first aspect, the context selection block includes a global context module guided based on global information, a local context module guided based on global information, and a fusion module; the method further comprises the following steps: the global context module adaptively fuses the global context of the input data input to the global context module to different pixels of the input data according to an attention mechanism guided by global information to obtain output data with the global context information; the local context module carries out fusion processing on the local context of the input data input to the local context module in a self-adaptive manner according to an attention mechanism guided by global information to obtain output data with local context information; and the fusion module performs splicing fusion to output the selected feature graph according to the output data of the global context module and the output data of the local context module.

In a possible implementation of the first aspect, the input data of the local context module includes the selected feature map and the primary feature map, for example, the local context module in the 1 st context selection block takes the received n +1 th primary feature map and the n-th primary feature map as the input data of the 1 st local context module, and the adaptive fusion processing of the input data by the local context module may be the adaptive fusion of the local context of the n-th primary feature map to different pixels of the n +1 th primary feature map. And the local context module in the ith context selection block takes the received ith-1 selected feature map and the (n + 1) -i primary feature map as input data of the ith local context module, and the local context module performs fusion processing on the input data by self-adaptively fusing the local context of the primary feature map to different pixels of the selected feature map. Wherein i is more than or equal to 2 and less than or equal to n.

In a possible implementation of the first aspect, the global context module in the 1 st context selection block takes the received n +1 th primary feature map as input data of the 1 st global context module, and obtains output data of the 1 st global context module; the local context module in the 1 st context selection block takes the received n +1 st primary feature map and the n primary feature map as input data of the 1 st local context module and obtains output data of the 1 st local context module; and the fusion module in the 1 st context selection block performs feature splicing fusion according to the output data of the 1 st global context module and the output data of the 1 st local context module to obtain and output a1 st selection feature map.

The global context module in the ith context selection block takes the received ith-1 selection feature map as input data of the ith global context module and obtains output data of the ith global context module; the local context module in the ith context selection block takes the received ith-1 selection feature map and the (n + 1-i) th primary feature map as input data of the ith local context module and obtains output data of the ith local context module; the fusion module in the ith context selection block performs feature splicing fusion according to the output data of the ith global context module and the output data of the ith local context module to obtain an ith selection feature map; i is more than or equal to 2 and less than or equal to n-1.

The global context module in the nth context selection block takes the received nth-1 selection feature map as input data of the nth global context module and obtains output data of the nth global context module; the local context module in the nth context selection block takes the received nth-1 selection feature diagram and the 1 st primary feature diagram as input data of the nth local context module and obtains output data of the local context module; and the fusion module in the nth context selection block performs feature splicing fusion according to the output data of the nth global context module and the output data of the nth local context module to obtain an nth selection feature map which is used as a secondary feature map.

In a possible implementation of the first aspect, the performing, by the global context module, global information-based guided selective fusion of the global context on the input data input to the context selection module includes: carrying out global average pooling operation processing on input data to obtain a global pooling feature map; fusing the input data and the global pooling feature map to obtain a global context attention map guided based on global information; enhancing and suppressing the global pooling feature at different pixel positions through a global context attention map to obtain a global context feature map based on global information guidance; and fusing the input data and the global context feature map guided based on the global information to obtain the output data of the global context module subjected to global context selection.

In a possible implementation of the first aspect, performing a global average pooling operation on input data of the global context module to obtain a global pooled feature map includes: and sequentially carrying out global average pooling operation, convolution operation, batch normalization operation, activation function processing and upsampling operation processing on the input data of the global context module to obtain a global pooling feature map.

In a possible implementation of the first aspect, the fusing the input data of the global context module and the global pooling feature map to obtain a global context attention map includes: performing convolution operation, batch normalization operation and activation function processing on input data of the global context module; and splicing and fusing the processed input data and the global pooling characteristic graph, and sequentially performing convolution operation, batch normalization operation, activation function processing, convolution operation and gating operation to obtain a global context attention diagram.

In a possible implementation of the first aspect, the obtaining a global context feature map guided based on global information by performing enhancement and suppression on the global pooled feature map at different pixel positions through a global context attention map includes: and carrying out Hadamard product operation on the global context attention diagram and the channels of the global pooling feature diagram one by one to obtain a global context feature diagram based on global information guidance.

In a possible implementation of the first aspect, fusing input data of the global context module and the global context feature map guided based on global information includes: and performing point-by-point addition operation on the input data of the global context module and the global context feature map guided based on the global information to obtain the output data of the global context module.

In a possible implementation of the first aspect, the performing, by the local context module, global information-guided selective fusion of local contexts on input data input into the local context module includes: the method comprises the steps of up-sampling a selected feature map in input data to obtain an up-sampling feature map; carrying out global average pooling on the up-sampling feature map to obtain a global pooling feature map; performing convolution processing on the ith primary feature map input into the local context module to obtain a corresponding primary local context feature map, wherein i is more than or equal to 1 and is less than or equal to n; obtaining a local context attention diagram based on global information guidance according to the up-sampling feature diagram, the global pooling feature diagram and the primary local context feature diagram; enhancing or suppressing different pixel positions of the primary local context feature map through a local context attention map to obtain a local context feature map guided based on global information; and fusing the up-sampling feature graph and the local context feature graph guided based on the global information to obtain the output data of the local context module subjected to local context selection.

In a possible implementation of the first aspect, performing global average pooling on the upsampled feature map to obtain a global pooled feature map includes: and sequentially carrying out global pooling operation, convolution operation, batch normalization operation, activation function processing and upsampling operation processing on the upsampling feature map to obtain a global pooling feature map.

In a possible implementation of the first aspect, obtaining a local context attention map according to the upsampled feature map, the global pooled feature map, and the primary local context feature map includes: performing convolution operation, batch normalization operation and activation function processing on the up-sampling feature map, and performing convolution operation, batch normalization operation and activation function processing on the primary local context feature map; and splicing and fusing the processed up-sampling feature map, the primary local context feature map and the global pooling feature map, and sequentially performing convolution operation, batch normalization operation, activation function processing, convolution operation and gating operation to obtain a local context attention map.

In a possible implementation of the first aspect, obtaining a local context feature map guided based on global information by performing enhancement or suppression on different pixel positions of a primary local context feature map through a local context attention map includes: and carrying out Hadamard product operation on the local context attention diagram and the primary local context feature diagram channel by channel to obtain a local context feature diagram guided based on global information.

In a possible implementation of the first aspect, fusing the upsampled feature map and the local context feature map based on global information guidance includes: and sequentially carrying out splicing fusion and convolution operation, batch normalization operation and activation function processing on the up-sampling feature graph and the local context feature graph guided based on the global information to obtain output data of the local context module.

In a possible implementation of the first aspect, the performing, by the backbone network, feature extraction on the input data source image includes: performing feature transformation on an input data source image layer by layer at least in a mode of a convolutional layer, a batch normalization layer and an activation layer; different backbone network modules are stacked by utilizing a residual structure in the backbone network, so that the flow of information and the reverse propagation of gradient are strengthened, and further, the feature semantic expressions of different levels are obtained.

In a second aspect, an embodiment of the present application provides a global boot selective context network, including: the method comprises a backbone network, a context selection network and a pixel classification network, wherein: the system comprises a main network, a context selection network and a plurality of primary feature maps, wherein the main network is used for receiving an input data source image, extracting features of the input data source image to obtain at least one primary feature map, and inputting the at least one primary feature map into the context selection network; the context selection network is used for performing an attention mechanism guided based on global information on the at least one primary feature map to obtain weight factors fusing the global context and the local context at different pixel positions of the at least one primary feature map, adaptively fusing the global context and the local context for each pixel in the at least one primary feature map according to the weight factors to obtain a secondary feature map, and inputting the secondary feature map to the pixel classification network; the pixel classification network is used for classifying the secondary feature maps one by one to obtain a scene semantic analysis result.

In a possible implementation of the second aspect, the backbone network is an image classification network, the backbone network includes at least one backbone network module, the backbone network module is configured to output a primary feature map, and the context selection network includes at least one context selection block; the backbone network module is used for carrying out feature extraction on an input data source image to obtain and output a primary feature image to a context selection block; the context selection block is used for fusing the global context selection and the local context selection of the primary feature map to obtain a secondary feature map.

In a possible implementation of the second aspect, the backbone network includes n +1 backbone network modules with different spatial resolutions, the context selection network includes n context selection blocks, and n is a positive integer greater than or equal to 3; the method further comprises the following steps: the 1 st trunk network module is used for outputting a1 st primary feature map according to an input data source image and inputting the 1 st primary feature map to the 2 nd trunk network module and the nth context selection block; the ith trunk network module is used for outputting an ith primary feature map to a next-level trunk network module and inputting the ith primary feature map to a corresponding (n + 1-i) th context selection block according to the (i-1) th primary feature map, wherein i is more than or equal to 2 and is less than or equal to n; the (n + 1) th trunk network module is used for outputting an (n + 1) th primary feature map according to the nth primary feature map and inputting the n +1 th primary feature map to the 1 st context selection block; the 1 st context selection block is used for selecting global context and local context for the received n +1 st primary feature map and the n primary feature map, outputting the 1 st selection feature map and inputting the 1 st selection feature map to the 2 nd context selection block; the ith language context selection block is used for selecting global context and local context for the received ith-1 selection feature map and the (n + 1) -i primary feature map, outputting the ith selection feature map and inputting the ith selection feature map into the next level context selection block, wherein i is more than or equal to 2 and is less than or equal to n-1; the nth context selection block is used for selecting global context and local context for the received nth-1 selection feature graph and the 1 st primary feature graph, and outputting the nth selection feature graph as a secondary feature graph.

In a possible implementation of the second aspect, the context selection block includes a global context module guided based on global information, a local context module guided based on global information, and a fusion module; the global context module is used for adaptively fusing global context information of input data input to the global context module to different pixels of the input data according to an attention mechanism guided by the global information to obtain output data with the global context information; the local context module is used for adaptively performing fusion processing on local context information of input data input to the local context module according to an attention mechanism guided by global information to obtain output data with the local context information; and the fusion module is used for splicing, fusing and outputting the selected feature graph according to the output data of the global context module and the output data of the local context module.

In a possible implementation of the second aspect, the global context module in the 1 st context selection block takes the received n +1 th primary feature map as input data of the 1 st global context module, and obtains output data of the 1 st global context module; the local context module in the 1 st context selection block takes the received n +1 st primary feature map and the n primary feature map as input data of the 1 st local context module and obtains output data of the 1 st local context module; and the fusion module in the 1 st context selection block performs feature splicing fusion according to the output data of the 1 st global context module and the output data of the 1 st local context module to obtain and output a1 st selection feature map.

The global context module in the nth context selection block takes the received nth-1 selection feature map as input data of the nth global context module and obtains output data of the nth global context module; the local context module in the nth context selection block takes the received nth-1 selection feature diagram and the 1 st primary feature diagram as input data of the nth local context module and obtains output data of the local context module; and the fusion module in the nth context selection block performs feature splicing fusion according to the output data of the nth global context module and the output data of the ith local context module to obtain an nth selection feature map which is used as a secondary feature map.

The global-boot selective context network provided by the present application is a global-boot selective context network for implementing the scene semantic parsing method based on the global-boot selective context network provided by the first aspect and/or any one of the possible implementation manners of the first aspect, and therefore, the beneficial effects (or advantages) of the scene semantic parsing method based on the global-boot selective context network provided by the first aspect can also be achieved.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a computer program, the computer program comprising program instructions; and the processor is used for executing program instructions so as to enable the electronic equipment to execute the scene semantic parsing method based on the global guide selective context network.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions that are executed by a computer to enable the computer to execute the foregoing scene semantic parsing method based on global boot selective context network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.

FIG. 1 is a schematic diagram illustrating gain cases of global context and local context for different pixels on an image in an FCN network according to some embodiments of the present application;

FIG. 2 is a schematic diagram illustrating the structure of a global boot selective context network and a scene semantic parsing process based on the global boot selective context network according to some embodiments of the present application;

FIG. 3 is a schematic diagram illustrating a global information-guided attention mechanism, according to some embodiments of the present application;

FIG. 4 is a schematic diagram illustrating the structure of an SCB in an SCB network and its processing according to some embodiments of the present application;

fig. 5 is a schematic diagram illustrating a process of a GGM module in a global boot-strap selective context network according to some embodiments of the present application;

FIG. 6 is a schematic diagram illustrating a process of globally directing a GLM module in a selective context network, according to some embodiments of the present application;

FIG. 7 is a schematic diagram illustrating the structure of a global boot selective context network and a scene semantic parsing process based on the global boot selective context network according to some embodiments of the present application;

FIG. 8A is a schematic diagram illustrating the structure and processing of an SCB in another SCB network, according to some embodiments of the present application;

FIG. 8B is a schematic diagram illustrating the structure and processing of an SCB in yet another SCB network, according to some embodiments of the present application;

FIG. 9 is a schematic diagram illustrating an electronic device, according to some embodiments of the present application;

fig. 10 is a schematic diagram illustrating a structure of a system on a chip (SoC), according to some embodiments of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

When scene semantic analysis is carried out, the effective receptive field of the pixel features can be increased by introducing the global context into each pixel feature, so that the semantic guidance of global information can be obtained when the pixel features carry out semantic discrimination. The global context can pay more attention to and analyze large-scale targets and scenes, so that the accuracy of large-scale target identification is improved, and the situation of large target identification ambiguity is avoided.

When scene semantic analysis is carried out, shallow features in the network can be used as local context and introduced into the up-sampled high-level features to supplement spatial detail information of the high-level features, so that the network is promoted to carry out effective feature learning on target edges and fine target areas, and a fine segmentation result is realized. However, noise interference is introduced in the process of fusing the shallow feature and the high-level feature, which is not beneficial to semantic discrimination of large-scale targets.

For example, referring to fig. 1, fig. 1 shows gain conditions of different categories in a data set when global context and local context are added to an FCN respectively on a citrescaps validation set, where the abscissa is 19 categories and overall indexes of the citrescaps data set, and the ordinate is gain conditions obtained by comparing FCN plus global context and FCN plus local context with FCN, and the gain can be embodied by Mean Intersection over unit (mlou).

It can be seen from fig. 1 that the gains for the global context and the local context for the different classes are different. For example, the global context has a large gain for large-scale targets such as "buses", "trucks", "trains", etc., and a small gain for small-scale targets such as "utility poles", "traffic lights", "traffic signs", etc. The local context is the opposite.

Since the global context and the local context have different influences on the segmentation results of the targets with different scales, it is necessary to consider guiding the effective fusion of the two contexts to realize more accurate scene analysis when the global context and the local context are utilized.

In the scene analysis task, the scales of the targets are rich, the scales of different types are different, and the two types of context information generate different gains for the targets with different scales, so that it is necessary to treat the pixels of different targets differently, that is, different pixels are added with different global context information and local context information. In addition, when a person analyzes a very complex scene, the person often analyzes the scene from a global view angle, and specific targets are specifically analyzed after the overall semantic recognition, so that global information is very necessary in the scene semantic analysis process. It is reasonable to use global information to guide the network to fuse the global context and the local context.

In addition, not all pixels need a global context or a local context, where a large-scale target region tends to avoid a local context (because the local context receptive field is small and semantic information is insufficient), and a small target tends to avoid a global context (because the global context lacks the semantic features of the small target).

Therefore, the application provides a Global-guided Selective Context Network (GSCNet), which is specifically a Global-guided Selective Context Network based on Global information guidance, and a scene semantic analysis method based on the Global-guided Selective Context Network, can simultaneously consider that different pixels have different degrees of dependence on Global Context and local Context, simultaneously perform Global Context selection and local Context selection on an input data source image, and can well distinguish large objects and small objects in the image, thereby improving the precision of semantic segmentation.

Referring to fig. 2, the present application provides a global boot Selective Context Network, which includes a backbone Network 110, a Context Selection (SCB) Network 120, and a pixel classification Network 130, where the backbone Network 110 may be a Residual Network (ResNet) pre-trained by ImageNet, or may be other lightweight networks (such as mobilene series or Xception Network), which is a basic classification Network. The backbone network 110 performs primary feature extraction on the input data source image 10 to obtain at least one primary feature map 11, and outputs the primary feature map 11 to the SCB network 120. The SCB network 120 obtains weighting factors fusing the global context and the local context at different pixel positions of the primary feature map 11 based on the global information-guided attention mechanism for the primary feature map 11, and adaptively fuses the global context and the local context for each pixel in the primary feature map 11 according to the weighting factors, that is, the SCB network 120 selectively fuses the global context and the local context information guided based on the global information for the primary feature map 11 to obtain the secondary feature map 12. The SCB network 120 inputs the obtained secondary feature map 12 to the pixel classification network 130, and the secondary feature map 12 is subjected to pixel-by-pixel classification through the pixel classification network 130 to obtain a scene semantic parsing result 13 corresponding to the input data source image 10.

In the present application, SCB network 120 includes at least one SCB.

Referring to fig. 3, the core steps of the global information-guided attention mechanism provided in the present application mainly include:

s101, the SCB performs Global average pooling operation on the input feature map input into the SCB to obtain a Global pooled feature map (Global pooled feature).

S102, the SCB splices and fuses a Global Context or a Local Context (Global/Local Context), a Global pooling feature Map and an input feature Map to obtain a Global or Local Context attention Map (Global/Local Context attention Map) guided based on Global information.

S103, the SCB performs channel-by-channel Hash code product operation on the Global Context attention diagram and the Global Context to obtain a Global Context feature diagram (Global-defined Global Context) based on Global information guidance, and performs channel-by-channel Hash code product operation on the Local Context attention diagram and the Local Context to obtain a Local Context feature diagram (Global-defined Local Context) based on Global information guidance.

S104, the SCB fuses the global context feature map based on global information guidance and the input feature map (by point-to-point addition or splicing fusion), so that a global context output feature map can be obtained, and the SCB fuses the local context feature map based on global information guidance and the input feature map (by point-to-point addition or splicing fusion), so that a local context output feature map can be obtained.

The SCB may fuse the global context output feature map and the local context output feature map to obtain a selection feature map output by the SCB.

Referring to fig. 4, the SCB provided in the present application may include a Global-guided Global context Module 1201 (GGM) (which may be referred to as a Global context Module) and a Global-guided Local context Module 1202 (GLM) (which may be referred to as a Local context Module) arranged in parallel.

The GGM1201 is configured to adaptively fuse global context information to each pixel feature based on a global information-oriented attention mechanism for input data (from a primary feature map of the backbone network 110 or a feature map of an upper SCB block) input therein, and output data with the global context information, that is, output a global context output feature.

The GLM1202 is used to adaptively fuse local context information to each pixel feature based on a global information-oriented attention mechanism for input data (from the primary feature map of the backbone network 110 and the feature map of the previous SCB block) input therein, and output data with the local context information, i.e., output local context output features.

Further, the SCB may further include a fusion module to fuse the output features of the GGM1201 and the GLM1202, further improving the expression capability of the features.

Referring to fig. 4, the fusion module may include a feature concatenation (Concat)1203 and a convolutional layer (Conv)1204, configured to perform concatenation and fusion on feature maps output by the GGM1201 and the GLM1202 to obtain a selected feature map of the output of the SCB.

The introduction of global context into the high-level features can increase the receptive field of the features, which is helpful for improving the segmentation of large-scale targets or scenes. The semantic discrimination of the pixel characteristics from the tiny targets and the target edges depends on local structural information, and the guidance of the global context on the semantic information is limited. It is therefore necessary to perform differentiated processing for each pixel when fusing the global context. In addition, the global information can represent the overall semantic understanding of the image, which is helpful for obtaining the distinction between the whole image and the details, so that the selective global context fusion can be guided by the GGM1201 for each pixel feature based on the global information.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a processing procedure of a GGM1201 provided in the present application. Illustratively, for the input feature map A1, where A1 ∈ R^C×H×W(i.e. the dimension of a1, where R denotes the feature space, H denotes the image height, W denotes the image width, and C denotes the number of channels of the feature map), the GM1201 performs selective fusion of global context on the input feature map a1, and mainly includes:

s201, the GGM1201 first performs a Global average pooling (Global pooling) operation, a ConvB operation, and an Upsampling (Upsampling) operation on the input feature map a1 in sequence, to generate a Global pooled feature map B1.

Wherein ConvB operation is Conv1 × 1+ BN + Relu, that is, ConvB includes convolution layer with convolution kernel size of 1 × 1, Batch Normalization layer (BN), and Linear rectification layer (ReLU). Upsampling employs replication along the spatial dimension.

S202, a ConvA operation is performed on the input feature map a1 to generate a converted feature map.

Wherein the ConvA operation is Conv3 × 3+ BN + Relu, i.e. including convolution layer with convolution kernel size of 3 × 3, batch normalization layer and linear rectification layer.

S203, performing feature splicing (Concat) on the converted input feature map A1 and the global pooled feature map B1.

S204, the connected input feature map A1 and the global pooling feature map B1 are sequentially subjected to ConvA operation, ConvD operation and gating operation (such as sigmoid), and a global context attention map G (G e R) is obtained^H×W) Where ConvD operates on convolutional layers with convolutional kernel size 1x1, the response of global context attention map G reflects how much different pixels need the global context, and the global context attention map G activates the larger pixels, mainly the larger scale targets, which need the global context information more.

S205, enhancing or suppressing different pixel position information of the global pooling feature map B1 by using the global context attention map G.

Illustratively, the global context attention map G is subjected to a hash-code product (e.g., Hadamard product) with the channel-by-channel map of the global pooled feature map B1, and then point-multiplied by the scale factor parameter α, so as to obtain a global context feature map C guided based on global information.

S206, fusing the global context feature graph C guided by the global information into the input feature graph A1, wherein the fusion mode can be point-by-point addition (Element-wise sum), and generating a feature Og of adaptively fusing the global context information as the output feature of the global context module GGM. Wherein Og is belonged to R^C×H×W。

The operation process of S205 and S206 can be expressed by the following formula:

O_i＝a_i+αg_ib_i

wherein, a_i∈A，g_i∈B，b_i∈B，O_i∈Og，i∈[1，2，……，H×W]I is the ith position of spatial resolution in A, G, B, Og; alpha g_ib_iPixel features representing the ith position of the global context feature map guided based on the global information; α is a learning factor and can be initialized to 1.

In the application, the GGM1201 constructs an attention mechanism based on global information guidance, so that a global context attention diagram is obtained, and the global context is adaptively fused for each pixel point, wherein a large-scale target and a scene can obtain more global context information, which is beneficial to increasing the receptive field of pixel characteristics and improving the classification accuracy of the large-scale target. Furthermore, the GGM1201 reduces interference of large target information in a global context at a small target place. Here, the global information and the global context are obtained by using the global average pooling characteristic.

As previously described, GLM1202 focuses on improving the segmentation accuracy of object edges and fine objects. Since the primary feature map of the backbone network (here, the shallow feature of the network) is mainly responsive to the local detail of the target, its receptive field is small and retains high-resolution location information, which can be used to improve the spatial detail representation of the feature. However, since this property is not conducive to accurate identification of large-scale target subjects, the requirements of different pixel points for spatial detail information need to be considered when fusing local contexts. Local context module GLM1202 in the present application provides local context information of different degrees to different scale target regions through a global information-guided attention mechanism, that is, the emphasis fine target regions are more fused, and the large scale target regions are opposite.

Referring to fig. 6, fig. 6 shows a schematic processing procedure of a GLM1202 provided by the present application, and for an input feature map a2, where a ∈ R^C×H×WGLM1202 performs local context selective fusion on input feature map a2, mainly including:

s301, perform upsampling (e.g., bilinear interpolation) on the input feature map a2 to increase the resolution by two times, so as to obtain a feature map D.

S302, performing global average pooling operation on the feature map D, and then performing ConvB operation and upsampling operation to generate a global pooled feature map B2. The ConvB operation is the same as that in S201.

S303, a ConvA operation is performed on the primary feature map output by the backbone network to obtain a transformed primary feature map E, and the ConvA operation is the same as the ConvA operation in S202.

S304, ConvA operation is carried out on the transformed primary feature map E to generate a transformed feature map.

S305, ConvA operation is carried out on the feature diagram D to generate a converted feature diagram.

And S306, performing feature splicing on the feature maps obtained after the conversion of the S304 and the S305 and the global pooled feature map B2 according to channel dimensions.

S307, sequentially carrying out ConvA operation, ConvD operation and sigmoid operation on the spliced features to obtain a local context attention diagram L, wherein the L belongs to R^H×WThe response of the local context attention map L indicates how much different pixels are required for the local context, and the activation of larger pixels in the local context attention map L tends to be a thin target or an edge of a target, whose pixel characteristics require more local information.

S308, enhancing or suppressing different pixel characteristics of the transformed primary characteristic diagram E by using the local context attention diagram L to obtain a local context characteristic diagram F based on global information guidance.

And carrying out a hash code product on the local context attention diagram L and the transformed primary feature diagram E one by one, and then carrying out dot multiplication on a scale factor beta to obtain a local context feature diagram F based on global information guidance.

S309, performing feature splicing on the local context feature graph F and the feature graph D guided based on the global information, wherein the feature splicing mode can be a mode that the features are spliced (Concat) according to channel dimensions.

S310, after the characteristics are connected, ConvA operation is carried out to generate a characteristic O of self-adaptive fusion local context information_lAs an output characteristic of the local context module GLM.

The operation process of S308-S310 can be expressed by the following formula:

can be defined as O_i＝conv(cat(d_i,βl_ie_i))

Wherein d is_i∈D，l_i∈L，e_i∈B，O_i∈O_l，i∈[1，2，……，H×W]I is D, L, B, O_lThe ith position at the spatial resolution of (1); beta l_ie_iPixel characteristics of the ith position of the local context characteristic map F which represents global information guidance; β is a learning factor and can be initialized to 1.

In the application, GLM1202 can fuse local contexts of different degrees at different pixel positions, and learns weight factors through a network to adaptively fuse the local contexts, so that the spatial detail representation of a fine target is enhanced, and the semantic noise influence on the pixel characteristics of a large-scale target area is reduced.

In the method, the setting of the convolutional layer in the SCB and the operation of feature fusion (such as point-to-point addition, feature splicing, and the like) can be selected as required, and the guidance of adding global information in the process of generating the attention map in the attention mechanism is an important idea of the method, so that corresponding selection can be made as required in actual implementation.

In this application, the backbone network layer 110 may be a common image classification network, and the feature extraction performed on the input data source image 10 by the backbone network layer 110 includes: performing feature transformation on the input data source image 10 layer by layer at least in a mode of a convolutional layer, a batch normalization layer and an activation layer; and connecting different backbone network modules by using a residual structure in the backbone network 110, so as to strengthen the flow of information and the reverse propagation of gradient, and further obtain feature semantic expressions of different levels.

Referring to fig. 7, for example, in an implementation manner of the present application, the backbone network 110 may include a first backbone network module, a second backbone network module, a third backbone network module, and a fourth backbone network module, which are sequentially arranged according to a hierarchy. The trunk network of the present application may adopt a residual network, and therefore the first trunk network module is a convolutional layer (conv-1), the second trunk network module is a first residual unit (Resnet block-1), the third trunk network module is a second residual unit (Resnet block-2), and the fourth trunk network module is a third residual unit (Resnet block-3) and a fourth residual unit (Resnet block-4), where the fourth residual module (Resnet block-4) has removed a downsampling operation and has adopted a dilatational convolution operation, and it obtains 1/16 whose final characteristic resolution is an input image.

Of course, the backbone network 110 may be other image classification networks (e.g., the mobilene series).

In the present application, SCB network 120 may include SCB121, SCB122, and SCB123 arranged in order by hierarchy.

Inputting an input data source image 10 into conv-1, conv-1 extracting the features of the input data source image 10 to obtain a1 st primary feature map, inputting the 1 st primary feature diagram into Resnet block-1 and SCB123, respectively, Resnet block-1 extracting the features of the 1 st primary feature diagram to obtain the 2 nd primary feature diagram, inputting the 2 nd primary feature diagram into Resnet block-2 and SCB122, respectively, Resnet block-2 extracting the features of the 2 nd primary feature diagram to obtain the 3 rd primary feature diagram, and the 3 rd primary signature graph is respectively input into Resnet block-3 and Resnet block-4, and inputting the feature extraction result to the SCB121, Resnet block-3 and Resnet block-4 to perform feature extraction processing on the 3 rd primary feature map to obtain a 4 th primary feature map, and inputting the 4 th primary feature map to the SCB 121.

The input data of the local context module of the GLM in SCB121 includes the 4 th primary feature map and the 3 rd primary feature map, the GLM adaptively fuses the local context from the 3 rd primary feature map to the 4 th primary feature map, and obtains and outputs the output data of the 1 st local context module; the input data of the global context module of the GGM in the SCB121 comprises a 4 th primary feature map, and the GGM performs global context selective fusion on the 4 th primary feature map to obtain and output the output data of a1 st global context module; a fusion module in the SCB121 fuses the output data of the 1 st global context module and the output data of the 1 st local context module to obtain and output a1 st selection feature map; that is, SCB121 outputs the 1 st selected feature map, and inputs the 1 st selected feature map to SCB 122.

The input data of the local context module of the GLM in the SCB122 comprises a1 st selected feature map and a2 nd primary feature map, and the GLM in the SCB122 adaptively fuses the local context from the 2 nd primary feature map to the 1 st primary feature map to obtain and output the output data of the 2 nd local context module; the input data of the global context module of the GGM in the SCB122 comprises a1 st selection feature map, and the GGM in the SCB122 performs global context selective fusion on the 1 st selection feature map to obtain and output the output data of a2 nd global context module; a fusion module in the SCB122 obtains and outputs a2 nd selection feature map according to the output data of the 2 nd global context module and the output data of the 2 nd local context module; that is, the SCB122 outputs the 2 nd selected feature map, and inputs the 2 nd selected feature map to the SCB 123.

The input data of the local context module of the GLM in the SCB123 comprises a2 nd selected feature map and a1 st primary feature map, and the GLM in the SCB123 adaptively fuses the local context from the 1 st primary feature map to the 2 nd primary feature map to obtain and output the output data of the 3 rd local context module; the input data of the global context module of the GGM in the SCB123 comprises a2 nd selection feature map, and the GGM in the SCB123 performs global context selective fusion on the 2 nd selection feature map to obtain and output the output data of a 3 rd global context module; a fusion module in the SCB123 obtains and outputs a 3 rd selection feature map according to the output data of the 3 rd global context module and the output data of the 3 rd local context module; that is, the SCB123 outputs a 3 rd selected profile, which 3 rd selected profile serves as the secondary profile 12 of the SCB network output of the SCB network 120.

In the pixel classification network 130, the final scene semantic analysis result 13 is obtained by performing pixel-by-pixel classification according to the secondary feature map 12 output by the SCB network 120.

In the application, the number of SCBs can be selected according to actual needs, and can be set to be three, four or more; in addition, the SCB may have only GLM or GGM, which may be selected according to actual needs.

The application provides a context network based on global guidance selectivity and a scene semantic analysis method based on the same, which are used for greatly improving the scene analysis precision through a self-adaptive network structure fusing context information from the viewpoint of analyzing an FCN network and improving the FCN network. It is mainly based on the attention mechanism of global information to adaptively fuse different levels of context. Specifically, the global pooling feature is introduced into the generation process of the attention map to obtain the demand degree of different pixels for different levels of context information, and the demand degree is used as a weight factor to control the fusion degree of the context information. In the attention diagram, it can be seen that large objects and scene regions are made to merge more global context while local context is reduced; and more local contexts are fused in the edge regions of the small targets and the targets, and the global context is reduced, so that the network can have more accurate identification capability on the large-scale targets and more fine segmentation results on the small targets and the edges.

If scene semantic analysis is performed only on the basis of the global context, the same global context is obtained for each pixel in the image, and the difference of the dependence degree of different pixels on the global context is ignored. For example, a pixel in the middle of a small object needs to be a pixel which belongs to the same object as the surrounding pixel, and an edge pixel of the object focuses more on the global context of the image, so that the objects are well distinguished. If the global context is obtained equally from each pixel, the context of the image must not be fully utilized. In addition, scene semantic analysis is performed only based on the global context, local (detail) information is not friendly, which often makes edge segmentation of objects inaccurate, or makes small objects difficult to segment well, and in addition, the global context is acquired, feature information at different positions is inevitably fused, if the proportion of a certain category in an image is small, the feature information is easily lost in the extraction operation of the global context, even when the original resolution of the image is recovered by an up-sampling method in the following process, the detail information cannot be recovered, and the identification and positioning conditions of small objects cannot be solved.

If scene semantic analysis is performed only based on local context, global context is needed to be lost for the target with gain for segmenting the pixel, the usable global context is lost, influence on detail noise in the image becomes sensitive, and classification errors are generated for internal classification of the significant target. By using the global guide selective context network and the scene semantic analysis method based on the global guide selective context network, the pixels in the image are guided based on the global context and are selected according to the attention mechanism, and the global context is guided in the local context selection process, so that the noise influence on the image caused by the local context is relieved.

Therefore, in the scene semantic analysis method based on the global guidance selective context network, the global context and the local context are adaptively selected and fused based on the attention mechanism of the global guidance, and each pixel in the image is selected for the context, so that the global context and the local context can be balanced, and finally, the segmentation model can accurately segment the large object and can also achieve accuracy for the segmentation of details. Compared with a method which can avoid scene semantic analysis based on global context only and a method which can avoid scene semantic analysis based on local context only, the method has higher adaptability to the context at the pixel level, and can perform feature extraction and fusion operation according to different dependence degrees of each pixel in the image on the global context and the local context so as to improve the semantic segmentation precision.

Further, in another implementation manner of the SCB block of the present application, the foregoing GGM1201 and GLM1202 may also be arranged in sequence according to a hierarchy.

For example, referring to fig. 8A, in another implementation of the present application, GLM1202 is arranged before GGM1201, that is, input data input to SCB is first subjected to feature extraction by GLM1202 to perform selective fusion on local context, and then input to GGM1201 to perform selective fusion on global context. In this way, the identification accuracy of the large-scale target can be enhanced after the spatial detail information of the small target is enhanced.

Referring to fig. 8B, in another implementation manner of the present application, the GGM1201 is disposed before the GLM1202, input data input to the SCB is selectively fused in a global context through the GGM1201, and then input to the GLM1202, and the GLM1202 selectively fuses in a local context. In this way, the whole SCB can be regarded as a coarse-to-fine feature enhancement process, i.e., details of the refined target and fine targets after learning from the enhanced large-scale target features.

In the present application, the SCBs in SCB network 120 may all have the same structure, such as the structure shown in fig. 4, or the structure shown in fig. 8A, or the structure shown in fig. 8B; the structure of the SCB may also be different, and may include, for example, both GGM and GLM, or may have only GGM, or only GLM. All of which can be specifically selected as desired.

It should be noted that, in the present application, the processing procedure of the SCB versus feature map shown in fig. 8A and the processing procedure of the SCB versus feature map shown in fig. 8B are similar to the processing procedure of the SCB versus feature map described in fig. 7, and will not be described in detail here.

The global boot selective context network provided by the present application may be an FCN-based global boot selective context network.

Further, in the present application, the pixel classification network 130 may be a common processing unit that converts features into results of pixel classification. In addition, the pixel classification network 130 may also exist as a pixel classification module at the end of the SCB network 120, so that the SCB network 120 may directly output the result of pixel classification, that is, the scene semantic analysis result 13. The settings of the pixel classification network 130 may be selected as desired.

It should be noted that the structure of the GSCNet provided in the present application includes, but is not limited to, the structure of fig. 7, and it may also be another type of structure, which may be selected as needed.

The global guide selective context network and the scene semantic analysis method based on the same can be applied to the fields of automatic driving, scene understanding, image editing, video analysis, medical image processing, remote sensing image processing, mobile phone photographing and the like in the field of terminal AI, and can provide guidance for other image processing and editing as an effective mode for image semantic understanding.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device 900 provided according to an embodiment of the present application. The electronic device 900 may include one or more processors 901 coupled to a controller hub 904. For at least one embodiment, controller hub 904 communicates with processor 901 via a multi-drop bus such as a front-side bus (FSB), a point-to-point interface such as a quick channel interconnect (QPI), or similar connection. Processor 901 executes instructions that control general types of data processing operations. In one embodiment, the controller hub 904 includes, but is not limited to, a Graphics Memory Controller Hub (GMCH) (not shown) and an input/output hub (IOH) (which may be on separate chips) (not shown), where the GMCH includes memory and graphics controllers and is coupled to the IOH.

The electronic device 900 may also include a coprocessor 906 and memory 902 coupled to the controller hub 904. Alternatively, one or both of the memory 902 and GMCH may be integrated within the processor 901 (as described herein), with the memory 902 and coprocessor 906 coupled directly to the processor 901 and to the controller hub 904, with the controller hub 904 and IOH in a single chip.

The memory 902 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two.

In one embodiment, coprocessor 906 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. The optional nature of coprocessor 906 is represented in FIG. 9 by dashed lines.

In one embodiment, electronic device 900 may further include a Network Interface (NIC) 903. The network interface 903 may include a transceiver to provide a radio interface for the electronic device 900 to communicate with any other suitable device (e.g., front end module, antenna, etc.). In various embodiments, the network interface 903 may be integrated with other components of the electronic device 900. The network interface 903 may implement the functions of the communication unit in the above-described embodiments.

The electronic device 900 may further include input/output (I/O) devices 905. Input/output (I/O) devices 905 may include: a user interface designed to enable a user to interact with the electronic device 900; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 900; and/or sensors are designed to determine environmental conditions and/or location information associated with electronic device 900.

It is noted that fig. 9 is merely exemplary. That is, although fig. 9 shows that the electronic apparatus 900 includes a plurality of devices, such as a processor 901, a controller hub 904 and a memory 902, in practical applications, an apparatus using the methods of the present application may include only a part of the devices of the electronic apparatus 900, for example, may include only the processor 901 and the NIC 903. The nature of the alternative device in fig. 9 is shown in dashed lines.

One or more tangible, non-transitory computer-readable media for storing data and/or instructions may be included in the memory of the electronic device 900. A computer-readable storage medium has stored therein instructions, and in particular, temporary and permanent copies of the instructions.

In this application, 9 the instructions stored in the memory of the electronic device may include: the apparatus includes a context semantic parsing module that, when executed by at least one unit in a processor, causes an electronic device to perform a scenario semantic parsing method based on a globally-directed selective context network.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an SoC (System on Chip) 1000 according to an embodiment of the present disclosure. In fig. 10, like parts have the same reference numerals. In addition, the dashed box is an optional feature of the more advanced SoC 1000. The SoC1000 may be used in any electronic device according to the present application. According to different devices and different instructions stored in the devices, corresponding functions can be realized.

In fig. 10, the SoC1000 includes: an interconnect unit 1002 coupled to the processor 1001; a system agent unit 1006; a bus controller unit 1005; an integrated memory controller unit 1003; a set or one or more coprocessors 1007 which may include integrated graphics logic, a feature map processor, an audio processor, and a video processor; an SRAM (static random access memory) unit 1008; a DMA (direct memory access) unit 1004. In one embodiment, the coprocessor 1007 comprises a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.

Included in SRAM cell 1008 may be one or more computer-readable media for storing data and/or instructions. A computer-readable storage medium may have stored therein instructions, in particular, temporary and permanent copies of the instructions. The instructions may include: the apparatus includes a context semantic parsing module that, when executed by at least one unit in a processor, causes an electronic device to perform a scenario semantic parsing method based on a globally-directed selective context network.

Embodiments of the mechanisms disclosed herein may be implemented in software, hardware, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems including at least one processor, memory (or storage systems including volatile and non-volatile memory and/or storage units).

Program code may be applied to the input instructions to perform the functions described in the text and to generate output information. The output information may be applied to one or more output devices in a known manner. It is appreciated that in embodiments of the present application, the processing system may be a microprocessor, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), the like, and/or any combination thereof. According to another aspect, the processor may be a single core processor, a multi-core processor, and/or the like, and/or any combination thereof.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processor. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this text are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may be implemented as one or more transitory or non-transitory and readable (e.g., computer-readable) storage media bearing or having stored thereon instructions that are readable and executable by one or more processors. For example, the instructions may be distributed via a network or a pneumatically readable computer medium. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash cards, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine.

One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium which represent various logic within a processor which, when read by a machine, cause the mechanism to operate as logic to perform the techniques described herein. These representations, known as "IP cores" may be stored on a tangible computer-readable storage medium and provided to a plurality of customers or manufacturing facilities for implementation to be loaded into the manufacturing machines that actually make the logic or processor.

In some cases, an instruction converter may be used to transfer instructions from a source instruction set to a target instruction set. For example, the instruction converter may transform (e.g., using a static binary transform, a dynamic binary transform including dynamic compilation), morph, emulate, or otherwise convert the instruction into one or more other instructions for processing by the core. The instruction converter may be implemented in software, hardware, firmware, or other combinations. The instruction converter may be on the processor, off-processor, or partially on and partially off-processor.

It is noted that, as used herein, the term module can refer to either an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality, or as part of a combination of such hardware components. That is, each module in each device embodiment of the present application is a logical module, and physically, one logical module may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, the device embodiments described above do not introduce modules that are not germane to solving the technical problems presented in the present application, which does not indicate that the device embodiments described above do not include other modules.

It should be noted that the terms "first," "second," and the like are used merely to distinguish one description from another, and are not intended to indicate or imply relative importance.

It should be noted that in the accompanying drawings, some structural or methodical features may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing is a more detailed description of the present application, and the present application is not intended to be limited to these details. Various changes in form and detail, including simple deductions or substitutions, may be made by those skilled in the art without departing from the spirit and scope of the present application.

Claims

1. A scene semantic parsing method based on global guide selective context network is characterized in that the network comprises a backbone network, a context selection network and a pixel classification network, and the method comprises the following steps:

the backbone network receives an input data source image, performs layer-by-layer feature extraction on the input data source image to obtain at least one primary feature map, and inputs the at least one primary feature map to the context selection network;

the context selection network obtains, for the at least one primary feature map, a weighting factor for fusing global context and local context at different pixel positions of the at least one primary feature map through an attention mechanism guided based on global information, adaptively fuses global context and local context for each pixel in the at least one primary feature map according to the weighting factor to obtain a secondary feature map, and inputs the secondary feature map to the pixel classification network;

and the pixel classification network classifies the secondary feature maps one by one to obtain a scene semantic analysis result.

2. The method for scene semantic analysis based on global boot selective context network according to claim 1, wherein the backbone network is an image classification network and comprises at least one backbone network module, the backbone network module is used for outputting a primary feature map, and the context selection network comprises at least one context selection block;

the method further comprises the following steps:

the backbone network module performs layer-by-layer feature extraction on the input data source image to obtain and output a primary feature map to the context selection block;

and the context selection block performs fusion of the global context selection and the local context selection on the primary feature map to obtain the secondary feature map.

3. The method for scene semantic analysis based on the global boot selective context network according to claim 2, wherein the backbone network comprises n +1 backbone network modules with different spatial resolutions, the context selection network comprises n context selection blocks, and n is a positive integer greater than or equal to 3;

the method further comprises the following steps:

the 1 st trunk network module outputs a1 st primary feature map according to the input data source image and inputs the 1 st primary feature map to the 2 nd trunk network module and the nth context selection block;

the ith trunk network module outputs the ith primary feature map to the next trunk network module and to the corresponding (n + 1-i) context selection block according to the (i-1) th primary feature map, wherein i is more than or equal to 2 and is less than or equal to n;

the (n + 1) th trunk network module outputs an (n + 1) th primary feature map according to the nth primary feature map and inputs the (n + 1) th primary feature map to the 1 st context selection block;

the 1 st context selection block selects the global context and the local context for the received n +1 st primary feature map and the n primary feature map, outputs the 1 st selection feature map and inputs the 1 st selection feature map to the 2 nd context selection block;

the ith language context selection block selects the global context and the local context for the received ith-1 selection feature map and the (n + 1-i) th primary feature map, outputs the ith selection feature map and inputs the ith selection feature map into the next level context selection block, wherein i is more than or equal to 2 and is less than or equal to n-1;

the nth context selection block selects the global context and the local context for the received nth-1 selection feature map and the 1 st primary feature map, and outputs the nth selection feature map as the secondary feature map.

4. The scene semantic analysis method based on the global guiding selective context network is characterized in that the context selection block comprises a global context module based on global information guiding, a local context module based on global information guiding and a fusion module;

the method further comprises the following steps:

the global context module adaptively fuses the global context of the input data input to the global context module to different pixels of the input data according to an attention mechanism guided by global information to obtain output data with global context information;

the local context module adaptively performs fusion processing on the local context of the input data input to the local context module according to an attention mechanism guided by global information to obtain output data with local context information;

and the fusion module performs splicing fusion to output a selected feature graph according to the output data of the global context module and the output data of the local context module.

5. The method for scene semantic parsing based on global boot selective context network according to claim 4,

the global context module in the 1 st context selection block takes the received (n + 1) th primary feature map as input data of the 1 st global context module and obtains output data of the 1 st global context module; the local context module in the 1 st context selection block takes the received n +1 st primary feature map and the n primary feature map as input data of the 1 st local context module and obtains output data of the 1 st local context module; the fusion module in the 1 st context selection block performs feature splicing fusion according to the output data of the 1 st global context module and the output data of the 1 st local context module to obtain and output a1 st selection feature map;

the global context module in the ith context selection block takes the received ith-1 selection feature map as input data of the ith global context module and obtains output data of the ith global context module; the local context module in the ith context selection block takes the received ith-1 selection feature map and the (n + 1-i) th primary feature map as input data of the ith local context module and obtains output data of the ith local context module; the fusion module in the ith context selection block performs feature splicing fusion according to the output data of the ith global context module and the output data of the ith local context module to obtain an ith selection feature map; i is more than or equal to 2 and less than or equal to n-1;

the global context module in the nth context selection block takes the received nth-1 selection feature map as input data of the nth global context module and obtains output data of the nth global context module; the local context module in the nth context selection block takes the received nth-1 selection feature diagram and the 1 st primary feature diagram as input data of the nth local context module and obtains output data of the local context module; and the fusion module in the nth context selection block performs feature splicing fusion according to the output data of the nth global context module and the output data of the ith local context module to obtain an nth selection feature map which is used as the secondary feature map.

6. The method for scene semantic resolution based on the global guidance selective context network according to claim 4 or 5, wherein the global context module performs global information guidance-based selective fusion of the global context on the input data input into the context selection module, and comprises:

carrying out global average pooling operation processing on the input data to obtain a global pooling feature map;

fusing the input data and the global pooling feature map to obtain a global context attention map guided based on global information;

enhancing and suppressing the global pooling feature at different pixel positions through the global context attention map to obtain a global context feature map based on global information guidance;

and fusing the input data and the global context feature map guided based on the global information to obtain the output data of the global context module subjected to global context selection.

7. The method for scene semantic analysis based on the global boot selective context network according to claim 6, wherein the global average pooling operation is performed on the input data of the global context module to obtain a global pooled feature map, and the method comprises:

and sequentially carrying out global pooling operation, convolution operation, batch normalization operation, activation function processing and upsampling operation processing on the input data of the global context module to obtain the global pooling characteristic diagram.

8. The scene semantic analysis method based on the global guide selective context network according to claim 6, wherein the fusion of the input data of the global context module and the global pooling feature map to obtain the global context attention map comprises:

performing convolution operation, batch normalization operation and activation function processing on the input data of the global context module;

and splicing and fusing the processed input data and the global pooling characteristic diagram, and sequentially performing convolution operation, batch normalization operation, activation function processing, convolution operation and gating operation to obtain the global context attention diagram.

9. The scene semantic analysis method based on the global guidance selective context network according to claim 6, wherein the global context feature map based on the global information guidance is obtained by enhancing and suppressing the global pooled feature map at different pixel positions through a global context attention map, and comprises:

and carrying out Hadamard product operation on the global context attention diagram and the channels of the global pooling feature diagram one by one to obtain the global context feature diagram guided based on the global information.

10. The scene semantic analysis method based on the global guidance selective context network according to claim 6, characterized in that the fusion of the input data of the global context module and the global context feature map based on the global information guidance comprises:

and performing point-by-point addition operation on the input data of the global context module and the global context feature map guided based on the global information to obtain the output data of the global context module.

11. The method for scene semantic parsing based on global guiding selective context network according to any one of claims 4-10, wherein the local context module performs global information guiding based selective fusion of local context on the input data inputted into the local context module, comprising:

up-sampling the selected feature map in the input data to obtain an up-sampled feature map;

carrying out global average pooling processing on the up-sampling feature map to obtain a global pooling feature map;

performing convolution processing on the ith primary feature map input into the local context module to obtain a corresponding primary local context feature map, wherein i is more than or equal to 1 and is less than or equal to n;

obtaining a local context attention diagram guided based on global information according to the up-sampling feature diagram, the global pooling feature diagram and the primary local context feature diagram;

enhancing or suppressing different pixel positions of the primary local context feature map through the local context attention map to obtain a local context feature map guided based on global information;

and fusing the up-sampling feature map and the local context feature map guided based on the global information to obtain the output data of the global context module subjected to the local context selection.

12. The method for scene semantic analysis based on global boot selective context network according to claim 11, wherein the global average pooling processing is performed on the upsampled feature map to obtain a global pooled feature map, and the method comprises:

and sequentially carrying out global pooling operation, convolution operation, batch normalization operation, activation function processing and upsampling operation processing on the upsampling feature map to obtain the global pooling feature map.

13. The method for scene semantic parsing based on global boot selective context network according to claim 11, wherein obtaining the local context attention map according to the upsampled feature map, the global pooled feature map and the primary local context feature map comprises:

performing convolution operation, batch normalization operation and activation function processing on the up-sampling feature map, and performing convolution operation, batch normalization operation and activation function processing on the primary local context feature map;

and splicing and fusing the processed up-sampling feature map, the primary local context feature map and the global pooling feature map, and sequentially performing convolution operation, batch normalization operation, activation function processing, convolution operation and gating operation to obtain the local context attention map.

14. The scene semantic analysis method based on the global guidance selective context network according to claim 11, wherein the local context feature map based on the global information guidance is obtained by enhancing or suppressing different pixel positions of the primary local context feature map through the local context attention map, and comprises:

and performing a Hadamard product operation on the local context attention diagram and the primary local context feature diagram channel by channel to obtain the local context feature diagram based on global information guidance.

15. The method for scene semantic analysis based on global guidance selective context network according to claim 11, wherein the fusing the up-sampling feature map and the local context feature map based on global information guidance comprises:

and sequentially carrying out splicing fusion and convolution operation, batch normalization operation and activation function processing on the up-sampling feature map and the local context feature map based on global information guidance to obtain output data of the local context module.

16. The global-guided selective context network-based scene semantic analysis method according to any one of claims 2-15, wherein the feature extraction of the input data source image by the backbone network comprises:

performing layer-by-layer feature transformation on the input data source image at least in a mode of a convolutional layer, a batch normalization layer and an activation layer;

and stacking different backbone network modules by utilizing a residual structure in the backbone network, and strengthening the flow of information and the reverse propagation of gradient so as to obtain the feature semantic expressions of different levels.

17. An electronic device, comprising:

a memory for storing a computer program, the computer program comprising program instructions;

a processor for executing the program instructions to cause the electronic device to perform the method for scene semantic parsing based on global boot selective context network according to any one of claims 1-16.

18. A computer-readable storage medium storing a computer program, the computer program comprising program instructions that are executed by a computer to cause the computer to execute the method for scene semantic resolution based on global boot selective context network according to any one of claims 1 to 16.