CN115471765B

CN115471765B - Semantic segmentation method, device and equipment for aerial image and storage medium

Info

Publication number: CN115471765B
Application number: CN202211359202.6A
Authority: CN
Inventors: 李新宇; 程昱; 方毅; 文龙
Original assignee: Guangzhou University Town Guangong Science And Technology Achievement Transformation Center; Guangdong University of Technology
Current assignee: Guangzhou University Town Guangong Science And Technology Achievement Transformation Center; Guangdong University of Technology
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-04-07
Anticipated expiration: 2042-11-02
Also published as: CN115471765A

Abstract

The application discloses a semantic segmentation method, a semantic segmentation device, equipment and a storage medium for aerial images, wherein the method comprises the following steps: acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle; carrying out encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence; the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism; and adopting a preset decoder in a preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result. The method and the device can solve the technical problems that in the prior art, accuracy is poor, complexity is improved, and accordingly the semantic segmentation efficiency of the aerial image is poor.

Description

Semantic segmentation method, device and equipment for aerial image and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a semantic segmentation method, apparatus, device, and storage medium for an aerial image.

Background

Most current transform-based aerial image segmentation methods directly adopt a 1D position coding method in a visual transform to provide the transform with position information of missing input tokens (image blocks). These 1D position coding methods were originally designed for 1D word sequence input in natural language processing tasks, and therefore it is obviously not suitable to record the position of input tokens in 2D pictures.

Although the problem can be solved by the proposed relative position coding method in the prior art, the problems of accuracy reduction caused by many-to-one mapping and complexity increase caused by parameter introduction still exist, and the efficiency of a semantic segmentation model is poor when the semantic segmentation model is used for processing a high-resolution aerial image.

Disclosure of Invention

The application provides a semantic segmentation method, a semantic segmentation device, equipment and a storage medium for an aerial image, which are used for solving the technical problem that in the prior art, the precision is poor, the complexity is improved, and the semantic segmentation efficiency of the aerial image is poor.

In view of the above, a first aspect of the present application provides a semantic segmentation method for an aerial image, including:

acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle;

carrying out coding operation on the preset aerial image sequence through a preset coder in a preset semantic segmentation model to obtain an aerial coding sequence;

the preset semantic segmentation model comprises a shallow layer jump connection and a deep layer cavity residual error connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;

and decoding the aerial photography coding sequence by adopting a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.

Preferably, the acquiring of the preset aerial image sequence based on the aerial image of the unmanned aerial vehicle includes:

acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;

carrying out N equal-part uniform segmentation operation on the aerial image of the unmanned aerial vehicle to obtain a plurality of image blocks, wherein N is a positive integer;

and expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain a preset aerial image sequence.

Preferably, the preset aerial image sequence is encoded by a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence, and the method further includes:

constructing a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;

generating a preset encoder through a plurality of transform network layers in series;

and connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises shallow layer jump connection and deep layer cavity residual error connection.

Preferably, the preset encoder and the preset decoder are connected by a preset connection structure to obtain a preset semantic segmentation model, and then the method further includes:

and performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.

The second aspect of the present application provides a semantic segmentation apparatus for an aerial image, including:

the acquisition module is used for acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle;

the encoding module is used for carrying out encoding operation on the preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence;

the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;

and the decoding module is used for decoding the aerial photography coding sequence by adopting a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.

Preferably, the obtaining module is specifically configured to:

carrying out uniform segmentation operation on the unmanned aerial vehicle aerial image in N equal parts to obtain a plurality of image blocks, wherein N is a positive integer;

Preferably, the method further comprises the following steps:

the building module is used for building a Transformer network layer according to a preset characteristic fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;

the generating module is used for generating a preset encoder through a plurality of transform network layers in series;

and the connection module is used for connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection.

Preferably, the method further comprises the following steps:

and the fine tuning module is used for performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.

A third aspect of the application provides a semantic segmentation apparatus for an aerial image, the apparatus comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the method for semantic segmentation of an aerial image according to the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method for semantic segmentation of an aerial image according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a semantic segmentation method for an aerial image, which comprises the following steps: acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle; carrying out coding operation on a preset aerial image sequence through a preset coder in a preset semantic segmentation model to obtain an aerial coding sequence; the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism; and adopting a preset decoder in a preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result.

According to the method for segmenting the semantics of the aerial images, the relative position information of the images is recorded by adopting the encoder which integrates the 2D position attention mechanism and the multi-head self-attention mechanism, the capacity of a model for capturing spatial information is improved, and the effective receptive field of a deep characteristic map can be improved by introducing hole residual connection into a deep network; excessive parameters are not introduced into the whole model, so that the complexity of the algorithm can be prevented from being deepened; and the network layer in the model is improved in pertinence according to the image characteristics, so that the accuracy of the segmentation result can be improved. Therefore, the method and the device can solve the technical problems that in the prior art, accuracy is poor, complexity is improved, and accordingly the semantic segmentation efficiency of the aerial image is poor.

Drawings

Fig. 1 is a schematic flowchart of a semantic segmentation method for an aerial image according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a semantic segmentation apparatus for an aerial image according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a preset aerial image sequence conversion process provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a preset semantic segmentation model framework provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of an attention mechanism network in a pre-encoder according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a network structure of deep hole residual connection according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Interpretation of terms:

transformer: is a deep learning model based on a self-attention mechanism completely. More precisely, the Transformer consists of and consists only of multi-head self-attack and Feed Forward Neural Network. Which was originally proposed in the field of natural language processing to process 1-dimensional word sequences. And then introduced into the field of computer vision to process 2D picture input due to its powerful ability to capture global semantic interactions.

Tokens: for natural semantic processing (NLP) tasks, the input of a Transformer is a 1-dimensional word sequence, so that one word vector is a token; for Computer Vision (CV) tasks, an input picture needs to be cut into equal-sized image blocks, expanded line by line, arranged into 1-dimensional image block sequence, and then sent to a Transformer for training. So an image block is a token.

Receptive field: the area where the input image can be seen by the convolutional neural network features is defined, in other words, the feature output is affected by the pixel points in the receptive field area.

Semantic segmentation: each pixel in the picture is assigned a label of its belonging category, the labels of each category being distinguished by a different color.

For easy understanding, please refer to fig. 1, an embodiment of a semantic segmentation method for an aerial image provided by the present application includes:

step 101, a preset aerial image sequence is obtained based on aerial images of the unmanned aerial vehicle.

Further, step 101 includes:

carrying out uniform segmentation operation on the aerial image of the unmanned aerial vehicle in N equal parts to obtain a plurality of image blocks, wherein N is a positive integer;

Unmanned aerial vehicle images of taking photo by plane scene is various, and the environment is complicated, so can carry out various preliminary treatment operations to it after obtaining unmanned aerial vehicle images of taking photo by plane, promote the quality of image from different aspects, the subsequent image processing of being convenient for.

It can be understood that the image blocks obtained by splitting the N equal parts belong to two-dimensional data, and the model input formed by the transform network is a 1-dimensional sequence, so that the image blocks need to be subjected to serialization conversion, that is, the image blocks are expanded on a pixel-by-pixel basis and then are arranged into a 1-dimensional sequence, specifically refer to fig. 3; or expanding the image column by column and then arranging the image column by column into a 1-dimensional sequence to obtain a preset aerial image sequence.

And 102, carrying out encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence.

The preset semantic segmentation model comprises shallow layer jump connection and deep layer cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism.

The preset semantic segmentation model is mainly composed of a preset encoder, a preset decoder, a shallow layer jump connection and a deep layer cavity residual error connection, wherein a 2D position attention mechanism and a multi-head self-attention mechanism are fused in the preset encoder, so that 2D relative position information between global semantic information and image blocks can be captured, and the spatial expression capability of image features is improved. Moreover, the 2D position attention mechanism adopted in the embodiment can manually adjust the effective range to adapt to the characteristics of the feature maps at different stages; therefore, the attention range of the position information can be adjusted according to the characteristics of the characteristic diagram of each stage, the method is more flexible and reliable, and the accuracy of image processing can be improved. The preset decoder is matched with the preset encoder, and a step-by-step up-sampling mechanism is adopted for decoding, so that the image semantic segmentation is realized. The effective receptive field of the deep characteristic diagram can be enlarged by the deep cavity residual connection, and the global information can be better captured.

Further, step 102, before, further comprising:

and connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection.

Further, a preset encoder and a preset decoder are connected by adopting a preset connection structure to obtain a preset semantic segmentation model, and then the method further comprises the following steps:

Referring to fig. 4, the preset encoder is composed of transform modules, each transform block in this embodiment includes 2 consecutive transform network layers and an overlapped fusion module, and each transform network layer is embedded with a multi-head self-attention mechanism and a 2D position attention mechanism, specifically referring to fig. 5, the results obtained by the two attention mechanisms can be integrated in a weighted sum manner, which can be implemented by the fusion module, and perform downsampling operation to reduce the length of the block sequence.

The connection between the preset encoder and the preset decoder comprises the deep hole residual error connection besides the original shallow layer jump connection; the former can obtain more abundant superficial detail characteristics, and the latter can enlarge the deep characteristic receptive field, and the two jointly improve the expression capability of the network to the image characteristics.

Specifically, referring to fig. 5, wherein N represents the sequence length, C represents the number of channels,r represents a reduction factor of a multiple of,

is a trainable weight. Obtaining query (q), key (k) and value (v) by linearly projecting the token sequence; in order to reduce the calculation consumption and the parameter quantity of the model, performing sequence reduction operation on k and v by using convolution; semantic attention is then calculated from the following formula:

wherein SA is semantic attention, i.e. a multi-head self-attention parameter matrix which is a mask matrix, and all elements of the matrix take values of 0-1,

is a normalization function>

、/>

、/>

The results query (q), key (k) and value (v) obtained by multi-head self-attention linear projection are vectors with 3 same dimensions obtained by the same image sequence projection, and are based on the results of the multi-head self-attention linear projection>

Is a scaling factor.

In addition, in the network layer of the 2D position attention mechanism, a coordinate matrix of the image sequence tokens in the 2D space is obtained, and then points are calculated based on the following formula

And a point->

Euclidean distance between coordinates:

wherein the content of the first and second substances,

is->

Coordinate, or>

Is->

And (4) coordinates. Since a token sequence, whose surrounding token sequence is important to distant sequences, the correspondence between relative distances can be mapped by a gaussian function:

wherein the content of the first and second substances,

as a result of the scaling of the distance sequence,Rrepresenting the sequence reduction factor, the 4 transform modules in this embodiment can be set from light to deep to have factor values of 8, 4, 2, and 1->

、/>

Are respectively imagesThe 2D position coordinate matrix of the sequence before and after the compression of the sequence length is the size of the 2D position coordinate matrixNX 2, the latter size being->

，/>

Is input by a Gaussian function>

The mean value of the gaussian function in this example is 0, which is the standard deviation of the gaussian function.

A 2D location attention weight may then be calculated at the sfotmax network layer and an attention weighted sum calculated based on the weights:

wherein the content of the first and second substances,

is position attention, is asserted>

The sum is weighted for attention.

Since the shallow layer skip connection structure is a common structure in the coding and decoding network, the feature diagram of the corresponding stage of the encoder is directly transmitted to the corresponding stage of the decoder to provide the detail information lost by the continuous downsampling operation, which is not described herein again. Referring to fig. 6, BN represents Batch normalization, reLU is an activation function, and rate is an expansion rate of the hole convolution. The receptive field is enlarged through two layers of continuous cavity convolution layers, the original characteristic diagram is reserved by using residual connection, namely the action principle of the deep cavity residual connection structure, and the characteristic diagram can be understood to be input from the bottom and output from the top of the graph in fig. 6.

The fine tuning training image set is also an aerial image, the data set is generally small and is only used for fine tuning a pre-trained model, the preset semantic segmentation model in the embodiment is not pre-trained except for a 2D position attention mechanism, a cavity residual error connection and a decoder, and other original reserved structures of a transform, such as a multi-head self-attention mechanism, a shallow layer jump connection structure and the like, are pre-trained; the preset semantic segmentation model is finely adjusted to optimize individual parameters in the model and improve the performance of the model, such as accuracy and reliability.

And 103, decoding the aerial photography coding sequence by adopting a preset decoder in a preset semantic segmentation model to obtain a semantic segmentation result.

According to the method for segmenting the semantics of the aerial images, the encoder which integrates the 2D position attention mechanism and the multi-head self-attention mechanism is adopted to record the relative position information of the images, the capacity of a model for capturing spatial information is improved, and the effective receptive field of a deep characteristic map can be improved by introducing hole residual error connection in a deep network; excessive parameters are not introduced into the whole model, so that the complexity of the algorithm can be prevented from being deepened; and the network layer in the model is improved in pertinence according to the image characteristics, so that the accuracy of the segmentation result can be improved. Therefore, the technical problem that in the prior art, accuracy is poor, complexity is improved, and accordingly semantic segmentation efficiency of aerial images is poor can be solved.

For ease of understanding, referring to fig. 2, the present application provides an embodiment of a semantic segmentation apparatus for aerial images, including:

an obtaining module 201, configured to obtain a preset aerial image sequence based on an aerial image of an unmanned aerial vehicle;

the encoding module 202 is configured to perform encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence;

the preset semantic segmentation model comprises a shallow layer jump connection and a deep layer cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;

and the decoding module 203 is configured to perform a decoding operation on the aerial photography coding sequence by using a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.

Further, the obtaining module 201 is specifically configured to:

Further, still include:

the building module 204 is used for building a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;

a generating module 205, configured to generate a preset encoder through a plurality of transform network layers in series;

the connection module 206 is configured to connect the preset encoder and the preset decoder by using a preset connection structure to obtain a preset semantic segmentation model, where the preset connection structure includes a shallow layer jump connection and a deep layer cavity residual error connection.

Further, still include:

and the fine tuning module 207 is configured to perform fine tuning training on the preset semantic segmentation model by using a preset fine tuning training image set, so as to optimize model parameters.

The application also provides semantic segmentation equipment for the aerial image, wherein the equipment comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is used for executing the semantic segmentation method of the aerial image in the above method embodiment according to the instructions in the program code.

The present application further provides a computer-readable storage medium for storing program code for performing the method for semantic segmentation of an aerial image in the above-described method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for executing all or part of the steps of the methods described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A semantic segmentation method for aerial images is characterized by comprising the following steps:

based on unmanned aerial vehicle image acquisition of taking photo by plane presets the image sequence of taking photo by plane, the acquisition process includes:

expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain the preset aerial image sequence;

based on a multi-head self-attention mechanism and a 2D position attention mechanism, a transform network layer is constructed according to a preset feature fusion network, and specifically:

calculating the semantic attention based on the multi-head self-attention mechanism according to the following formula:

wherein SA is semantic attention, is a multi-head self-attention parameter matrix, is a mask matrix, and all elements of the matrix take values of 0-1,

in a normalization function,>

、/>

、/>

the results query (q), key (k) and value (v) obtained from multi-head self-attention linear projection are vectors with 3 dimensions obtained from the same image sequence projection, and are combined>

Is a scaling factor;

in the network layer of the 2D position attention mechanism, a coordinate matrix of the image sequence tokens in a 2D space is obtained, and then points are calculated based on the following formula

And a point->

Euclidean distance between coordinates:

wherein the content of the first and second substances,

is->

Coordinate, or>

Is->

Coordinates, mapping the correspondence between the relative distances of two points by a gaussian function:

wherein the content of the first and second substances,

as a result of the scaling of the distance sequence,Rrepresents a sequence reduction factor,. Sup.>

、/>

Respectively 2D position coordinate matrixes of the image sequence before and after the compression of the sequence length, wherein the size of the former position coordinate matrix isNX 2, the latter size being->

，/>

Is input by a Gaussian function>

The standard deviation of the Gaussian function is shown, and the mean value of the Gaussian function is 0;

computing an attention weighted sum based on the weights:

wherein the content of the first and second substances,

is position attention, is asserted>

Weighting the sum for attention;

generating a preset encoder by serially connecting a plurality of the Transformer network layers;

connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection;

carrying out coding operation on the preset aerial image sequence through the preset coder in the preset semantic segmentation model to obtain an aerial coding sequence;

the preset encoder comprises the 2D position attention mechanism and the multi-head self-attention mechanism;

and adopting the preset decoder in the preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result.

2. The method for semantic segmentation of aerial images according to claim 1, wherein the preset encoder and the preset decoder are connected by a preset connection structure to obtain a preset semantic segmentation model, and then further comprising:

3. An apparatus for semantic segmentation of an aerial image, comprising:

the acquisition module is used for acquiring a preset aerial image sequence based on an aerial image of the unmanned aerial vehicle, and the acquisition module is specifically used for:

the building module is used for building a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism, and specifically comprises the following steps:

wherein SA is semantic attention, is a multi-head self-attention parameter matrix, is a mask matrix, and all elements of the matrix take the values of 0-1,

in a normalization function,>

、/>

、/>

Is a scaling factor;

in the network layer of the 2D position attention mechanism, acquiring a coordinate matrix of the image sequence tokens in a 2D space, and then calculating points based on the following formula

And a point->

Euclidean distance between coordinates:

wherein, the first and the second end of the pipe are connected with each other,

is->

Coordinates in or on>

Is->

wherein the content of the first and second substances,

、/>

Respectively 2D position coordinate matrixes of the image sequence before and after the compression of the sequence length, wherein the size of the matrixes isNX 2, the latter size being->

，/>

Is input by a Gaussian function>

The standard deviation of the Gaussian function is adopted, and the mean value of the Gaussian function is 0;

computing an attention weighted sum based on the weights:

wherein the content of the first and second substances,

is location attention, is on>

Weighting the sum for attention;

the connection module is used for connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection;

the encoding module is used for encoding the preset aerial image sequence through the preset encoder in the preset semantic segmentation model to obtain an aerial photographing encoding sequence;

and the decoding module is used for decoding the aerial photography coding sequence by adopting the preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.

4. The apparatus for semantic segmentation of an aerial image according to claim 3, further comprising:

5. An apparatus for semantic segmentation of an aerial image, the apparatus comprising a processor and a memory;

the processor is configured to perform the method of semantic segmentation of an aerial image of any of claims 1-2 according to instructions in the program code.

6. A computer-readable storage medium for storing program code for performing the method for semantic segmentation of aerial images according to any one of claims 1-2.