CN115471765B - Semantic segmentation method, device and equipment for aerial image and storage medium - Google Patents

Semantic segmentation method, device and equipment for aerial image and storage medium Download PDF

Info

Publication number
CN115471765B
CN115471765B CN202211359202.6A CN202211359202A CN115471765B CN 115471765 B CN115471765 B CN 115471765B CN 202211359202 A CN202211359202 A CN 202211359202A CN 115471765 B CN115471765 B CN 115471765B
Authority
CN
China
Prior art keywords
preset
semantic segmentation
sequence
aerial
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211359202.6A
Other languages
Chinese (zh)
Other versions
CN115471765A (en
Inventor
李新宇
程昱
方毅
文龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University Town Guangong Science And Technology Achievement Transformation Center
Guangdong University of Technology
Original Assignee
Guangzhou University Town Guangong Science And Technology Achievement Transformation Center
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University Town Guangong Science And Technology Achievement Transformation Center, Guangdong University of Technology filed Critical Guangzhou University Town Guangong Science And Technology Achievement Transformation Center
Priority to CN202211359202.6A priority Critical patent/CN115471765B/en
Publication of CN115471765A publication Critical patent/CN115471765A/en
Application granted granted Critical
Publication of CN115471765B publication Critical patent/CN115471765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30181Earth observation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application discloses a semantic segmentation method, a semantic segmentation device, equipment and a storage medium for aerial images, wherein the method comprises the following steps: acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle; carrying out encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence; the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism; and adopting a preset decoder in a preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result. The method and the device can solve the technical problems that in the prior art, accuracy is poor, complexity is improved, and accordingly the semantic segmentation efficiency of the aerial image is poor.

Description

Semantic segmentation method, device and equipment for aerial image and storage medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a semantic segmentation method, apparatus, device, and storage medium for an aerial image.
Background
Most current transform-based aerial image segmentation methods directly adopt a 1D position coding method in a visual transform to provide the transform with position information of missing input tokens (image blocks). These 1D position coding methods were originally designed for 1D word sequence input in natural language processing tasks, and therefore it is obviously not suitable to record the position of input tokens in 2D pictures.
Although the problem can be solved by the proposed relative position coding method in the prior art, the problems of accuracy reduction caused by many-to-one mapping and complexity increase caused by parameter introduction still exist, and the efficiency of a semantic segmentation model is poor when the semantic segmentation model is used for processing a high-resolution aerial image.
Disclosure of Invention
The application provides a semantic segmentation method, a semantic segmentation device, equipment and a storage medium for an aerial image, which are used for solving the technical problem that in the prior art, the precision is poor, the complexity is improved, and the semantic segmentation efficiency of the aerial image is poor.
In view of the above, a first aspect of the present application provides a semantic segmentation method for an aerial image, including:
acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle;
carrying out coding operation on the preset aerial image sequence through a preset coder in a preset semantic segmentation model to obtain an aerial coding sequence;
the preset semantic segmentation model comprises a shallow layer jump connection and a deep layer cavity residual error connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;
and decoding the aerial photography coding sequence by adopting a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.
Preferably, the acquiring of the preset aerial image sequence based on the aerial image of the unmanned aerial vehicle includes:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out N equal-part uniform segmentation operation on the aerial image of the unmanned aerial vehicle to obtain a plurality of image blocks, wherein N is a positive integer;
and expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain a preset aerial image sequence.
Preferably, the preset aerial image sequence is encoded by a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence, and the method further includes:
constructing a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;
generating a preset encoder through a plurality of transform network layers in series;
and connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises shallow layer jump connection and deep layer cavity residual error connection.
Preferably, the preset encoder and the preset decoder are connected by a preset connection structure to obtain a preset semantic segmentation model, and then the method further includes:
and performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.
The second aspect of the present application provides a semantic segmentation apparatus for an aerial image, including:
the acquisition module is used for acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle;
the encoding module is used for carrying out encoding operation on the preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence;
the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;
and the decoding module is used for decoding the aerial photography coding sequence by adopting a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.
Preferably, the obtaining module is specifically configured to:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out uniform segmentation operation on the unmanned aerial vehicle aerial image in N equal parts to obtain a plurality of image blocks, wherein N is a positive integer;
and expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain a preset aerial image sequence.
Preferably, the method further comprises the following steps:
the building module is used for building a Transformer network layer according to a preset characteristic fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;
the generating module is used for generating a preset encoder through a plurality of transform network layers in series;
and the connection module is used for connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection.
Preferably, the method further comprises the following steps:
and the fine tuning module is used for performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.
A third aspect of the application provides a semantic segmentation apparatus for an aerial image, the apparatus comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for semantic segmentation of an aerial image according to the first aspect according to instructions in the program code.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method for semantic segmentation of an aerial image according to the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
the application provides a semantic segmentation method for an aerial image, which comprises the following steps: acquiring a preset aerial image sequence based on the aerial image of the unmanned aerial vehicle; carrying out coding operation on a preset aerial image sequence through a preset coder in a preset semantic segmentation model to obtain an aerial coding sequence; the preset semantic segmentation model comprises shallow jump connection and deep cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism; and adopting a preset decoder in a preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result.
According to the method for segmenting the semantics of the aerial images, the relative position information of the images is recorded by adopting the encoder which integrates the 2D position attention mechanism and the multi-head self-attention mechanism, the capacity of a model for capturing spatial information is improved, and the effective receptive field of a deep characteristic map can be improved by introducing hole residual connection into a deep network; excessive parameters are not introduced into the whole model, so that the complexity of the algorithm can be prevented from being deepened; and the network layer in the model is improved in pertinence according to the image characteristics, so that the accuracy of the segmentation result can be improved. Therefore, the method and the device can solve the technical problems that in the prior art, accuracy is poor, complexity is improved, and accordingly the semantic segmentation efficiency of the aerial image is poor.
Drawings
Fig. 1 is a schematic flowchart of a semantic segmentation method for an aerial image according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a semantic segmentation apparatus for an aerial image according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a preset aerial image sequence conversion process provided in an embodiment of the present application;
fig. 4 is a schematic diagram of a preset semantic segmentation model framework provided in the embodiment of the present application;
FIG. 5 is a schematic diagram of an attention mechanism network in a pre-encoder according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a network structure of deep hole residual connection according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Interpretation of terms:
transformer: is a deep learning model based on a self-attention mechanism completely. More precisely, the Transformer consists of and consists only of multi-head self-attack and Feed Forward Neural Network. Which was originally proposed in the field of natural language processing to process 1-dimensional word sequences. And then introduced into the field of computer vision to process 2D picture input due to its powerful ability to capture global semantic interactions.
Tokens: for natural semantic processing (NLP) tasks, the input of a Transformer is a 1-dimensional word sequence, so that one word vector is a token; for Computer Vision (CV) tasks, an input picture needs to be cut into equal-sized image blocks, expanded line by line, arranged into 1-dimensional image block sequence, and then sent to a Transformer for training. So an image block is a token.
Receptive field: the area where the input image can be seen by the convolutional neural network features is defined, in other words, the feature output is affected by the pixel points in the receptive field area.
Semantic segmentation: each pixel in the picture is assigned a label of its belonging category, the labels of each category being distinguished by a different color.
For easy understanding, please refer to fig. 1, an embodiment of a semantic segmentation method for an aerial image provided by the present application includes:
step 101, a preset aerial image sequence is obtained based on aerial images of the unmanned aerial vehicle.
Further, step 101 includes:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out uniform segmentation operation on the aerial image of the unmanned aerial vehicle in N equal parts to obtain a plurality of image blocks, wherein N is a positive integer;
and expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain a preset aerial image sequence.
Unmanned aerial vehicle images of taking photo by plane scene is various, and the environment is complicated, so can carry out various preliminary treatment operations to it after obtaining unmanned aerial vehicle images of taking photo by plane, promote the quality of image from different aspects, the subsequent image processing of being convenient for.
It can be understood that the image blocks obtained by splitting the N equal parts belong to two-dimensional data, and the model input formed by the transform network is a 1-dimensional sequence, so that the image blocks need to be subjected to serialization conversion, that is, the image blocks are expanded on a pixel-by-pixel basis and then are arranged into a 1-dimensional sequence, specifically refer to fig. 3; or expanding the image column by column and then arranging the image column by column into a 1-dimensional sequence to obtain a preset aerial image sequence.
And 102, carrying out encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence.
The preset semantic segmentation model comprises shallow layer jump connection and deep layer cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism.
The preset semantic segmentation model is mainly composed of a preset encoder, a preset decoder, a shallow layer jump connection and a deep layer cavity residual error connection, wherein a 2D position attention mechanism and a multi-head self-attention mechanism are fused in the preset encoder, so that 2D relative position information between global semantic information and image blocks can be captured, and the spatial expression capability of image features is improved. Moreover, the 2D position attention mechanism adopted in the embodiment can manually adjust the effective range to adapt to the characteristics of the feature maps at different stages; therefore, the attention range of the position information can be adjusted according to the characteristics of the characteristic diagram of each stage, the method is more flexible and reliable, and the accuracy of image processing can be improved. The preset decoder is matched with the preset encoder, and a step-by-step up-sampling mechanism is adopted for decoding, so that the image semantic segmentation is realized. The effective receptive field of the deep characteristic diagram can be enlarged by the deep cavity residual connection, and the global information can be better captured.
Further, step 102, before, further comprising:
constructing a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;
generating a preset encoder through a plurality of transform network layers in series;
and connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection.
Further, a preset encoder and a preset decoder are connected by adopting a preset connection structure to obtain a preset semantic segmentation model, and then the method further comprises the following steps:
and performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.
Referring to fig. 4, the preset encoder is composed of transform modules, each transform block in this embodiment includes 2 consecutive transform network layers and an overlapped fusion module, and each transform network layer is embedded with a multi-head self-attention mechanism and a 2D position attention mechanism, specifically referring to fig. 5, the results obtained by the two attention mechanisms can be integrated in a weighted sum manner, which can be implemented by the fusion module, and perform downsampling operation to reduce the length of the block sequence.
The connection between the preset encoder and the preset decoder comprises the deep hole residual error connection besides the original shallow layer jump connection; the former can obtain more abundant superficial detail characteristics, and the latter can enlarge the deep characteristic receptive field, and the two jointly improve the expression capability of the network to the image characteristics.
Specifically, referring to fig. 5, wherein N represents the sequence length, C represents the number of channels,r represents a reduction factor of a multiple of,
Figure 327210DEST_PATH_IMAGE001
is a trainable weight. Obtaining query (q), key (k) and value (v) by linearly projecting the token sequence; in order to reduce the calculation consumption and the parameter quantity of the model, performing sequence reduction operation on k and v by using convolution; semantic attention is then calculated from the following formula:
Figure 810144DEST_PATH_IMAGE002
wherein SA is semantic attention, i.e. a multi-head self-attention parameter matrix which is a mask matrix, and all elements of the matrix take values of 0-1,
Figure 95763DEST_PATH_IMAGE003
is a normalization function>
Figure 271530DEST_PATH_IMAGE004
、/>
Figure 196760DEST_PATH_IMAGE005
、/>
Figure 231188DEST_PATH_IMAGE006
The results query (q), key (k) and value (v) obtained by multi-head self-attention linear projection are vectors with 3 same dimensions obtained by the same image sequence projection, and are based on the results of the multi-head self-attention linear projection>
Figure 355002DEST_PATH_IMAGE007
Is a scaling factor.
In addition, in the network layer of the 2D position attention mechanism, a coordinate matrix of the image sequence tokens in the 2D space is obtained, and then points are calculated based on the following formula
Figure 452402DEST_PATH_IMAGE008
And a point->
Figure 864929DEST_PATH_IMAGE009
Euclidean distance between coordinates:
Figure 955244DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 949876DEST_PATH_IMAGE011
is->
Figure 139549DEST_PATH_IMAGE008
Coordinate, or>
Figure 101689DEST_PATH_IMAGE012
Is->
Figure 480849DEST_PATH_IMAGE009
And (4) coordinates. Since a token sequence, whose surrounding token sequence is important to distant sequences, the correspondence between relative distances can be mapped by a gaussian function:
Figure 782517DEST_PATH_IMAGE013
Figure 205408DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 402647DEST_PATH_IMAGE015
as a result of the scaling of the distance sequence,Rrepresenting the sequence reduction factor, the 4 transform modules in this embodiment can be set from light to deep to have factor values of 8, 4, 2, and 1->
Figure 772448DEST_PATH_IMAGE016
、/>
Figure 725361DEST_PATH_IMAGE017
Are respectively imagesThe 2D position coordinate matrix of the sequence before and after the compression of the sequence length is the size of the 2D position coordinate matrixNX 2, the latter size being->
Figure 335465DEST_PATH_IMAGE018
,/>
Figure 6618DEST_PATH_IMAGE019
Is input by a Gaussian function>
Figure 180110DEST_PATH_IMAGE020
The mean value of the gaussian function in this example is 0, which is the standard deviation of the gaussian function.
A 2D location attention weight may then be calculated at the sfotmax network layer and an attention weighted sum calculated based on the weights:
Figure 3841DEST_PATH_IMAGE021
Figure 34114DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 130246DEST_PATH_IMAGE023
is position attention, is asserted>
Figure 654899DEST_PATH_IMAGE024
The sum is weighted for attention.
Since the shallow layer skip connection structure is a common structure in the coding and decoding network, the feature diagram of the corresponding stage of the encoder is directly transmitted to the corresponding stage of the decoder to provide the detail information lost by the continuous downsampling operation, which is not described herein again. Referring to fig. 6, BN represents Batch normalization, reLU is an activation function, and rate is an expansion rate of the hole convolution. The receptive field is enlarged through two layers of continuous cavity convolution layers, the original characteristic diagram is reserved by using residual connection, namely the action principle of the deep cavity residual connection structure, and the characteristic diagram can be understood to be input from the bottom and output from the top of the graph in fig. 6.
The fine tuning training image set is also an aerial image, the data set is generally small and is only used for fine tuning a pre-trained model, the preset semantic segmentation model in the embodiment is not pre-trained except for a 2D position attention mechanism, a cavity residual error connection and a decoder, and other original reserved structures of a transform, such as a multi-head self-attention mechanism, a shallow layer jump connection structure and the like, are pre-trained; the preset semantic segmentation model is finely adjusted to optimize individual parameters in the model and improve the performance of the model, such as accuracy and reliability.
And 103, decoding the aerial photography coding sequence by adopting a preset decoder in a preset semantic segmentation model to obtain a semantic segmentation result.
According to the method for segmenting the semantics of the aerial images, the encoder which integrates the 2D position attention mechanism and the multi-head self-attention mechanism is adopted to record the relative position information of the images, the capacity of a model for capturing spatial information is improved, and the effective receptive field of a deep characteristic map can be improved by introducing hole residual error connection in a deep network; excessive parameters are not introduced into the whole model, so that the complexity of the algorithm can be prevented from being deepened; and the network layer in the model is improved in pertinence according to the image characteristics, so that the accuracy of the segmentation result can be improved. Therefore, the technical problem that in the prior art, accuracy is poor, complexity is improved, and accordingly semantic segmentation efficiency of aerial images is poor can be solved.
For ease of understanding, referring to fig. 2, the present application provides an embodiment of a semantic segmentation apparatus for aerial images, including:
an obtaining module 201, configured to obtain a preset aerial image sequence based on an aerial image of an unmanned aerial vehicle;
the encoding module 202 is configured to perform encoding operation on a preset aerial image sequence through a preset encoder in a preset semantic segmentation model to obtain an aerial coding sequence;
the preset semantic segmentation model comprises a shallow layer jump connection and a deep layer cavity residual connection, and the preset encoder comprises a 2D position attention mechanism and a multi-head self-attention mechanism;
and the decoding module 203 is configured to perform a decoding operation on the aerial photography coding sequence by using a preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.
Further, the obtaining module 201 is specifically configured to:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out N equal-part uniform segmentation operation on the aerial image of the unmanned aerial vehicle to obtain a plurality of image blocks, wherein N is a positive integer;
and expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain a preset aerial image sequence.
Further, still include:
the building module 204 is used for building a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism;
a generating module 205, configured to generate a preset encoder through a plurality of transform network layers in series;
the connection module 206 is configured to connect the preset encoder and the preset decoder by using a preset connection structure to obtain a preset semantic segmentation model, where the preset connection structure includes a shallow layer jump connection and a deep layer cavity residual error connection.
Further, still include:
and the fine tuning module 207 is configured to perform fine tuning training on the preset semantic segmentation model by using a preset fine tuning training image set, so as to optimize model parameters.
The application also provides semantic segmentation equipment for the aerial image, wherein the equipment comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is used for executing the semantic segmentation method of the aerial image in the above method embodiment according to the instructions in the program code.
The present application further provides a computer-readable storage medium for storing program code for performing the method for semantic segmentation of an aerial image in the above-described method embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for executing all or part of the steps of the methods described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (6)

1. A semantic segmentation method for aerial images is characterized by comprising the following steps:
based on unmanned aerial vehicle image acquisition of taking photo by plane presets the image sequence of taking photo by plane, the acquisition process includes:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out N equal-part uniform segmentation operation on the aerial image of the unmanned aerial vehicle to obtain a plurality of image blocks, wherein N is a positive integer;
expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain the preset aerial image sequence;
based on a multi-head self-attention mechanism and a 2D position attention mechanism, a transform network layer is constructed according to a preset feature fusion network, and specifically:
calculating the semantic attention based on the multi-head self-attention mechanism according to the following formula:
Figure 862188DEST_PATH_IMAGE001
wherein SA is semantic attention, is a multi-head self-attention parameter matrix, is a mask matrix, and all elements of the matrix take values of 0-1,
Figure 733454DEST_PATH_IMAGE002
in a normalization function,>
Figure 951946DEST_PATH_IMAGE003
、/>
Figure 624367DEST_PATH_IMAGE004
、/>
Figure 108569DEST_PATH_IMAGE005
the results query (q), key (k) and value (v) obtained from multi-head self-attention linear projection are vectors with 3 dimensions obtained from the same image sequence projection, and are combined>
Figure 829532DEST_PATH_IMAGE006
Is a scaling factor;
in the network layer of the 2D position attention mechanism, a coordinate matrix of the image sequence tokens in a 2D space is obtained, and then points are calculated based on the following formula
Figure 902530DEST_PATH_IMAGE007
And a point->
Figure 327212DEST_PATH_IMAGE008
Euclidean distance between coordinates:
Figure 236393DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 823364DEST_PATH_IMAGE010
is->
Figure 767180DEST_PATH_IMAGE007
Coordinate, or>
Figure 775544DEST_PATH_IMAGE011
Is->
Figure 155710DEST_PATH_IMAGE008
Coordinates, mapping the correspondence between the relative distances of two points by a gaussian function:
Figure 484054DEST_PATH_IMAGE012
Figure 344694DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 280551DEST_PATH_IMAGE014
as a result of the scaling of the distance sequence,Rrepresents a sequence reduction factor,. Sup.>
Figure 961062DEST_PATH_IMAGE015
、/>
Figure 889835DEST_PATH_IMAGE016
Respectively 2D position coordinate matrixes of the image sequence before and after the compression of the sequence length, wherein the size of the former position coordinate matrix isNX 2, the latter size being->
Figure 599122DEST_PATH_IMAGE017
,/>
Figure 955148DEST_PATH_IMAGE018
Is input by a Gaussian function>
Figure 388535DEST_PATH_IMAGE019
The standard deviation of the Gaussian function is shown, and the mean value of the Gaussian function is 0;
computing an attention weighted sum based on the weights:
Figure 855419DEST_PATH_IMAGE020
Figure 877602DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 404529DEST_PATH_IMAGE022
is position attention, is asserted>
Figure 934999DEST_PATH_IMAGE023
Weighting the sum for attention;
generating a preset encoder by serially connecting a plurality of the Transformer network layers;
connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection;
carrying out coding operation on the preset aerial image sequence through the preset coder in the preset semantic segmentation model to obtain an aerial coding sequence;
the preset encoder comprises the 2D position attention mechanism and the multi-head self-attention mechanism;
and adopting the preset decoder in the preset semantic segmentation model to decode the aerial photography coding sequence to obtain a semantic segmentation result.
2. The method for semantic segmentation of aerial images according to claim 1, wherein the preset encoder and the preset decoder are connected by a preset connection structure to obtain a preset semantic segmentation model, and then further comprising:
and performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.
3. An apparatus for semantic segmentation of an aerial image, comprising:
the acquisition module is used for acquiring a preset aerial image sequence based on an aerial image of the unmanned aerial vehicle, and the acquisition module is specifically used for:
acquiring an aerial image of the unmanned aerial vehicle through the unmanned aerial vehicle;
carrying out N equal-part uniform segmentation operation on the aerial image of the unmanned aerial vehicle to obtain a plurality of image blocks, wherein N is a positive integer;
expanding the image blocks line by line on the basis of pixels and then arranging the image blocks into a one-dimensional sequence to obtain the preset aerial image sequence;
the building module is used for building a Transformer network layer according to a preset feature fusion network based on a multi-head self-attention mechanism and a 2D position attention mechanism, and specifically comprises the following steps:
calculating the semantic attention based on the multi-head self-attention mechanism according to the following formula:
Figure 143257DEST_PATH_IMAGE001
wherein SA is semantic attention, is a multi-head self-attention parameter matrix, is a mask matrix, and all elements of the matrix take the values of 0-1,
Figure 827136DEST_PATH_IMAGE002
in a normalization function,>
Figure 711916DEST_PATH_IMAGE003
、/>
Figure 854315DEST_PATH_IMAGE004
、/>
Figure 866265DEST_PATH_IMAGE005
the results query (q), key (k) and value (v) obtained from multi-head self-attention linear projection are vectors with 3 dimensions obtained from the same image sequence projection, and are combined>
Figure 597460DEST_PATH_IMAGE006
Is a scaling factor;
in the network layer of the 2D position attention mechanism, acquiring a coordinate matrix of the image sequence tokens in a 2D space, and then calculating points based on the following formula
Figure 466190DEST_PATH_IMAGE007
And a point->
Figure 95886DEST_PATH_IMAGE008
Euclidean distance between coordinates:
Figure 911526DEST_PATH_IMAGE009
wherein, the first and the second end of the pipe are connected with each other,
Figure 497229DEST_PATH_IMAGE010
is->
Figure 536860DEST_PATH_IMAGE007
Coordinates in or on>
Figure 591535DEST_PATH_IMAGE011
Is->
Figure 25182DEST_PATH_IMAGE008
Coordinates, mapping the correspondence between the relative distances of two points by a gaussian function:
Figure 717588DEST_PATH_IMAGE012
Figure 193699DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 526549DEST_PATH_IMAGE014
as a result of the scaling of the distance sequence,Rrepresents a sequence reduction factor,. Sup.>
Figure 746309DEST_PATH_IMAGE015
、/>
Figure 854073DEST_PATH_IMAGE016
Respectively 2D position coordinate matrixes of the image sequence before and after the compression of the sequence length, wherein the size of the matrixes isNX 2, the latter size being->
Figure 438769DEST_PATH_IMAGE017
,/>
Figure 264774DEST_PATH_IMAGE018
Is input by a Gaussian function>
Figure 209596DEST_PATH_IMAGE019
The standard deviation of the Gaussian function is adopted, and the mean value of the Gaussian function is 0;
computing an attention weighted sum based on the weights:
Figure 375129DEST_PATH_IMAGE024
Figure 193044DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 693295DEST_PATH_IMAGE022
is location attention, is on>
Figure 254858DEST_PATH_IMAGE023
Weighting the sum for attention;
the generating module is used for generating a preset encoder through a plurality of transform network layers in series;
the connection module is used for connecting the preset encoder and the preset decoder by adopting a preset connection structure to obtain a preset semantic segmentation model, wherein the preset connection structure comprises a shallow layer jump connection and a deep layer cavity residual error connection;
the encoding module is used for encoding the preset aerial image sequence through the preset encoder in the preset semantic segmentation model to obtain an aerial photographing encoding sequence;
the preset encoder comprises the 2D position attention mechanism and the multi-head self-attention mechanism;
and the decoding module is used for decoding the aerial photography coding sequence by adopting the preset decoder in the preset semantic segmentation model to obtain a semantic segmentation result.
4. The apparatus for semantic segmentation of an aerial image according to claim 3, further comprising:
and the fine tuning module is used for performing fine tuning training on the preset semantic segmentation model by adopting a preset fine tuning training image set to realize model parameter optimization.
5. An apparatus for semantic segmentation of an aerial image, the apparatus comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of semantic segmentation of an aerial image of any of claims 1-2 according to instructions in the program code.
6. A computer-readable storage medium for storing program code for performing the method for semantic segmentation of aerial images according to any one of claims 1-2.
CN202211359202.6A 2022-11-02 2022-11-02 Semantic segmentation method, device and equipment for aerial image and storage medium Active CN115471765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211359202.6A CN115471765B (en) 2022-11-02 2022-11-02 Semantic segmentation method, device and equipment for aerial image and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211359202.6A CN115471765B (en) 2022-11-02 2022-11-02 Semantic segmentation method, device and equipment for aerial image and storage medium

Publications (2)

Publication Number Publication Date
CN115471765A CN115471765A (en) 2022-12-13
CN115471765B true CN115471765B (en) 2023-04-07

Family

ID=84337564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211359202.6A Active CN115471765B (en) 2022-11-02 2022-11-02 Semantic segmentation method, device and equipment for aerial image and storage medium

Country Status (1)

Country Link
CN (1) CN115471765B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021251886A1 (en) * 2020-06-09 2021-12-16 Telefonaktiebolaget Lm Ericsson (Publ) Providing semantic information with encoded image data
CN114648535A (en) * 2022-03-21 2022-06-21 北京工商大学 Food image segmentation method and system based on dynamic transform
CN114998361A (en) * 2022-06-07 2022-09-02 山西云时代智慧城市技术发展有限公司 Agricultural land cover spatio-temporal semantic segmentation method based on transformations-MulMLA

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11321863B2 (en) * 2019-09-23 2022-05-03 Toyota Research Institute, Inc. Systems and methods for depth estimation using semantic features
CN111259898B (en) * 2020-01-08 2023-03-24 西安电子科技大学 Crop segmentation method based on unmanned aerial vehicle aerial image
CN111739035B (en) * 2020-06-30 2022-09-30 腾讯科技(深圳)有限公司 Image processing method, device and equipment based on artificial intelligence and storage medium
CN112396613A (en) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 Image segmentation method and device, computer equipment and storage medium
CN112560501B (en) * 2020-12-25 2022-02-25 北京百度网讯科技有限公司 Semantic feature generation method, model training method, device, equipment and medium
CN114821058A (en) * 2022-04-28 2022-07-29 济南博观智能科技有限公司 Image semantic segmentation method and device, electronic equipment and storage medium
CN115115835A (en) * 2022-06-16 2022-09-27 腾讯科技(深圳)有限公司 Image semantic segmentation method, device, equipment, storage medium and program product

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021251886A1 (en) * 2020-06-09 2021-12-16 Telefonaktiebolaget Lm Ericsson (Publ) Providing semantic information with encoded image data
CN114648535A (en) * 2022-03-21 2022-06-21 北京工商大学 Food image segmentation method and system based on dynamic transform
CN114998361A (en) * 2022-06-07 2022-09-02 山西云时代智慧城市技术发展有限公司 Agricultural land cover spatio-temporal semantic segmentation method based on transformations-MulMLA

Also Published As

Publication number Publication date
CN115471765A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
KR20220050758A (en) Multi-directional scene text recognition method and system based on multidimensional attention mechanism
KR20230074137A (en) Instance adaptive image and video compression using machine learning systems
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
US10445613B2 (en) Method, apparatus, and computer readable device for encoding and decoding of images using pairs of descriptors and orientation histograms representing their respective points of interest
CN115131675A (en) Remote sensing image compression method and system based on reference image texture migration
CN116721207A (en) Three-dimensional reconstruction method, device, equipment and storage medium based on transducer model
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
WO2023068953A1 (en) Attention-based method for deep point cloud compression
CN116600119B (en) Video encoding method, video decoding method, video encoding device, video decoding device, computer equipment and storage medium
WO2023193629A1 (en) Coding method and apparatus for region enhancement layer, and decoding method and apparatus for area enhancement layer
TWI826160B (en) Image encoding and decoding method and apparatus
US20230281881A1 (en) Video Frame Compression Method, Video Frame Decompression Method, and Apparatus
CN115471765B (en) Semantic segmentation method, device and equipment for aerial image and storage medium
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
WO2023005740A1 (en) Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device
WO2023050720A1 (en) Image processing method, image processing apparatus, and model training method
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
Li et al. Fast portrait segmentation with highly light-weight network
CN114399648A (en) Behavior recognition method and apparatus, storage medium, and electronic device
CN116912488B (en) Three-dimensional panorama segmentation method and device based on multi-view camera
TWI836972B (en) Underwater image enhancement method and image processing system using the same
CN113177483B (en) Video object segmentation method, device, equipment and storage medium
US20240029406A1 (en) Image processing method, training method, and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Cheng Yu

Inventor after: Li Xinyu

Inventor after: Fang Yi

Inventor after: Wen Long

Inventor before: Li Xinyu

Inventor before: Cheng Yu

Inventor before: Fang Yi

Inventor before: Wen Long