CN116452813B

CN116452813B - Image processing method, system, equipment and medium based on space and semantic information

Info

Publication number: CN116452813B
Application number: CN202310698749.7A
Authority: CN
Inventors: 韩军; 马梦圆; 黄惠玲
Original assignee: Quanzhou Institute of Equipment Manufacturing
Current assignee: Quanzhou Institute of Equipment Manufacturing
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-08-22
Anticipated expiration: 2043-06-14
Also published as: CN116452813A

Abstract

The application is suitable for the field of computer vision, and provides an image processing method, an image processing system, image processing equipment and an image processing medium based on space and semantic information. The image processing method based on the space and semantic information comprises the following steps: acquiring a first image, and performing semantic extraction processing on the first image to obtain a second image; extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image; and extracting space information from the third image, and carrying out semantic information guiding processing on the second image by utilizing the space information to obtain a fourth image. According to the application, through mutual learning optimization of the shallow space information and the deep semantic information, noise of shallow features is reduced rapidly and effectively, and then the deep features are guided to reconstruct the space information, so that the segmentation precision is effectively improved, the balance between the picture processing speed and the accuracy is achieved, and an additional side-road auxiliary or complex decoder is avoided.

Description

Image processing method, system, equipment and medium based on space and semantic information

Technical Field

The application belongs to the field of computer vision, and particularly relates to an image processing method, system, equipment and medium based on space and semantic information.

Background

Semantic segmentation is an important task in the field of computer vision, its application is widespread and evolving, with the aim of accurately predicting the label of each pixel in an image. The method is a key step for realizing visual scene understanding, and has wide application in the fields of automatic driving, medical image generation, image generation and the like.

The deep learning method is dominant in the semantic segmentation field, and a plurality of representative network models are proposed. Although deep learning methods are increasingly dominant in this field and many network models are proposed, these models either have higher accuracy but high computational cost, or are fast but low in accuracy, or make shallow features have more detailed information, and at the same time have much noise, or deep features have stronger semantic information, but lose some spatial information.

Therefore, it is difficult for the existing technology to satisfy both the accuracy and the speed of image processing.

Disclosure of Invention

The embodiment of the application aims to provide an image processing method based on space and semantic information, which aims to solve the problem that the prior art is difficult to meet the requirements of precision and speed of image processing at the same time.

The embodiment of the application is realized in such a way that an image processing method based on space and semantic information comprises the following steps:

acquiring a first image, and performing semantic extraction processing on the first image to obtain a second image;

extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image;

and extracting space information from the third image, and carrying out semantic information guiding processing on the second image by utilizing the space information to obtain a fourth image.

Another object of an embodiment of the present application is an image processing system based on spatial and semantic information, the image processing system comprising:

the main network is used for acquiring a first image, and carrying out semantic extraction processing on the first image to obtain a second image;

the semantic adjustment detail module is used for extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image;

the detail guiding semantic module is used for extracting space information from the third image, and conducting semantic information guiding processing on the second image by utilizing the space information to obtain a fourth image.

Another object of an embodiment of the application is a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the image processing method based on spatial and semantic information.

Another object of an embodiment of the present application is a computer-readable storage medium, on which a computer program is stored, which when being executed by a processor, causes the processor to perform the steps of the image processing method based on spatial and semantic information.

According to the image processing method based on the space and the semantic information, provided by the embodiment of the application, through mutual learning optimization of the shallow space information and the deep semantic information, the noise of shallow features is quickly and effectively reduced, and then the deep features are guided to reconstruct the space information, so that the segmentation precision is effectively improved, the balance between the picture processing speed and the accuracy is achieved, and an additional side-road auxiliary or complex decoder is avoided.

Drawings

FIG. 1 is a flow diagram of a method of image processing based on spatial and semantic information provided in one embodiment;

FIG. 2 is a block diagram of the spatial detail and semantic information mutual optimization network (DSMONet) provided in one embodiment;

FIG. 3 is a block diagram of a Mutual Optimization Module (MOM) provided in one embodiment;

FIG. 4 is a graph comparing segmentation accuracy (mIoU) and inference speed (FPS) on a Cityscapes test set, under one embodiment;

FIG. 5 is a block diagram of an image processing system based on spatial and semantic information provided in one embodiment;

FIG. 6 is a block diagram of the internal architecture of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of this disclosure.

As shown in fig. 1, in one embodiment, an image processing method based on spatial and semantic information is provided, where the image processing method includes steps S102 to S106:

step S102, a first image is obtained, and semantic extraction processing is carried out on the first image to obtain a second image.

In this embodiment, the first image is a shallow image, the resolution of the shallow image is high, the image has more spatial detail information, but the noise is relatively more. The second image is a deep image, which is obtained by processing a shallow image, has relatively low resolution and strong semantic information, but lacks spatial information.

Specifically, the optimization processing method of step S102 is developed in detail as steps S202 to S204:

step S202, obtaining an original image to be processed, and reducing the resolution of the original image by increasing the number of channels of the feature map to obtain the first image.

Step S204, performing feature extraction on the first image in a backbone network, and performing context aggregation to obtain the second image; the resolution of the second image is lower than the resolution of the first image.

As shown in fig. 2, the original image is an image to be processed, and after 4 steps, the resolution of the original image is reduced by increasing the number of channels of the feature map, so as to obtain images 1-4 one by one. The original image and the images 1-4 form a backbone network. The backbone network selects lightweight STDCNet, which has 5 stages, each stage stride is 2, the number of characteristic map channels is increased, and the resolution is reduced to 1/32 of the input image. In order to acquire the feature containing the global context information, a DAPM module is added after the backbone network to further extract the context information from the low-resolution feature map, so as to obtain a second image. In the embodiment, the image 1 is set to be optimized as the first image and the 4 th image is set to be optimized as the second image, but in actual operation, the technical scheme of the embodiment can be realized as long as the resolution of the first image is larger than that of the second image, so that the method is not limited to the first image and the second image which are specifically corresponding to the partial images in the backbone network, and the resolution requirement is only satisfied.

Semantic segmentation is to extract semantic information of deep features, optimize the extracted features, and then up-sample and output. Specifically, semantic information of deep features is extracted through a backbone network such as ResNet, STDC and the like. In the segmentation head, the number of feature channels is reduced to the number of categories by a Conv-BN-Relu operation, while an upsampling operation is performed, the feature size is extended to the input image size, and then the label of each pixel is predicted using an argmax operation. Cross entropy loss with on-line hard-case miningAnd optimizing the model. Placing a semantic header at the output of the UAFM generates additional semantic lossesTo better optimize the overall network.BCE loss is used to highlight boundary regions, enhancing features of small objects. The final loss is:

empirically, parameters of the mutual optimization network (DSMONet, details and Semantic Mutual Optimization NET) training loss of spatial detail and semantic information are set to。

Step S104, extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image.

In this embodiment, as shown in fig. 2 and 3, the inputs of the mutual optimization module (MOM, mutual Optimization Module) are two parts, namely, a second image and a first image, of the feature map S and the trunk feature output feature map D after DAPPM context aggregation. Wherein S has strong semantic information and D has spatial detail information. Therefore, the core of MOM is the mutual optimization of S and D, and as shown in fig. 3, the two processes are mainly divided: part of the optimization of the high resolution feature map by filtering the noise by the edge information and edge operators of the low resolution feature map is implemented by a semantic adjustment detail module (SADM, semantic adjustment details module); the other part is to guide the deep feature reconstruction to lose the spatial information through the optimized spatial information, and the deep feature reconstruction is realized by a detail guide semantic module (DGSM, details guide semantics module).

Specifically, step S104 further includes steps S302 to S304:

step S302, decoupling the second image, and extracting a first edge feature from the second image.

Further, step S302 further includes steps S402-404:

step S402, decoupling the second image, and acquiring a subject feature of the second image through a stream-based body feature representation method.

Step S404, subtracting the main feature from the second image to obtain the first edge feature

Wherein S is the characteristic of the second image,is the main of the second imageBody characteristics.

In this embodiment, SADM first decouples a feature map S with strong semantic information. Decoupling the second image into a subject feature according to DecoupleSegNetAnd a first edge featureI.e. the above formula is satisfied. Acquiring subject features of feature map S of second image by stream-based body feature representation methodThen by explicitly subtracting the subject features from the feature map STo obtain a first edge feature。

Step S304, extracting a second edge feature from the first image, and fusing the first edge feature and the second edge feature to obtain the third image.

Further, step S304 further includes steps S502-504:

step S502, optimizing the first image by using a laplace operator, and sampling the first image by using transpose convolution to obtain the second edge feature.

And step S504, carrying out feature fusion on the first edge feature and the second edge feature to obtain the third image.

In this embodiment, the high-resolution first image includes more detail information, and simultaneously has a lot of noise, and the edge information of the feature map is extracted by the laplace operator to obtain the second edge feature, so as to enhance the capturing capability of the model on the detail. Wherein, the following Laplace kernel can be selected

The kernel is incorporated into the network by a residual structure with laplace convolution. Using transpose convolution at first edge feature of second imageUp-sampling, wherein the up-sampled image has the same size as the optimized first image. Second edge characteristics and first edge characteristics after Laplace operator optimizationAnd (3) splicing, and then carrying out feature fusion through Conv-BN-Relu to obtain an optimized high-resolution third image. This process can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,a third image, which is a high resolution feature, gamma is a convolution layer, representing a cascading operation,represents edge information extraction using the laplace operator.

In this embodiment, the core of DSMONet is the mutual optimization of the high-resolution profile and the low-resolution profile, thus involving many upsampling sites. Bilinear interpolation calculates the value of a new pixel by weighted averaging the distances of the neighboring four pixel points. It can quickly upsample the feature map but may result in loss of detail and blurring of edges due to its smoothing of the feature map. The transposed convolution can better retain the details and edge information of the feature map, so the embodiment selects the transposed convolution to realize the up-sampling operation.

Further, the calculation amount of each transposed convolutional layer can also be reduced by adding a plurality of transposed convolutional layers. To reduce the amount of computation, this embodiment uses 3 transposed convolutional layers, where the number of output channels and the convolution kernel size are different for each transposed convolutional layer. If 8-fold upsampling is achieved directly using transposed convolution, then the convolution kernel size is 8, the stride is 8, and the calculated amount of each transposed convolution layer is flows= 67108864HW. If 8-fold upsampling is achieved using 3 transposed convolutional layers, then the convolution kernel size of each convolutional layer is 2, the stride is 2, and the computation of each transposed convolutional layer is 6815744HW. It can be seen that the computational effort is reduced by a factor of 10 compared to directly using one layer of transposed convolution.

And S106, extracting space information from the third image, and carrying out semantic information guiding processing on the second image by utilizing the space information to obtain a fourth image.

In this embodiment, after the shallow detail features are optimized by the deep semantic information, the optimized detail features are further usedTo guide the deep features to reconstruct the missing spatial information, i.e. to construct a fourth image with spatial detail information using the third image guidance of the high resolution features. Specifically, step S106 includes steps S602 to S604:

step S602, attention operation is adopted on the second image, decoupled main body characteristics are obtained from the second image, and the main body characteristics and the second image after channel attention and space attention processing are added to obtain a fifth image;

step S604, extracting the spatial information from the third image, and multi-scale fusing the spatial information and the fifth image to obtain the fourth image.

In this embodiment, to avoid losing semantic information during the process, additional attention is taken to the second image S before spatial information is reconstructed to enhance the correlation between the feature channels. As shown in fig. 3, the processed second image is decoupled from the subject features in SADMAdding to obtain a fifth imageFifth imageThere is stronger semantic information than S. Then a third image of the high resolution featureAnd a fifth imageMulti-scale feature fusion is performed, using spatial attention to enhance feature representation, depending on the job. The overall process can be expressed as the following equation:

wherein, the liquid crystal display device comprises a liquid crystal display device,a fifth image is represented and is displayed,representing the bulk characteristics of the decoupling in SADM,representing a subject feature of the second image,a third image representing a high resolution feature,the table represents the fourth image output.

The embodiment of the application provides an image processing method based on space and semantic information, which is applied to a mutual optimization module MOM. MOM consists of two parts, one part is to optimize a high resolution feature map (SADM) by filtering the noise by edge information and edge operators of the low resolution feature map; another part is the guidance of deep feature reconstruction missing spatial information (DGSM) through optimized spatial information. The MOM rapidly and effectively reduces noise of shallow features through mutual learning optimization of shallow space information and deep semantic information, and then guides deep features to reconstruct the space information, so that segmentation precision is effectively improved, and balance between picture processing speed and accuracy is achieved.

In one embodiment, specific experimental details of the application are given, consisting of sections 4:

1. data set of experiments

Cityscapes is a large urban street scene dataset. It contains 5000 fine annotation images and 20000 coarse annotation images with an image resolution of 2048 x 1024. The fine annotation image is further divided into 2975, 500 and 1525 sheets for training, verification and testing, respectively. The annotation contains 30 classes, but only 19 for semantic segmentation.

CamVid provides 701 driving scene images, which are divided into 367, 101, and 233, for training, verification, and testing, respectively. The image resolution is 960×720. The annotated image provides 32 categories, with a subset of 11 categories being used in accordance with the general setup for the experiments of the present application.

2. Implementation details

1) Inference settings

The experiment uses a preheating strategy and a plurality of learning rate schedulers to update the learning rate of each iteration, and uses data enhancement technologies such as random scaling, random clipping, random horizontal overturning, random color dithering, normalization and the like. Because the same backbone network is used, the pre-training weights provided by PP-LiteSeg [20] are used. For the Cityscapes dataset, a batch size of 16, a maximum number of iterations of 160000, an initial learning rate of 0.005, and a weight decay of 5e-4 in the optimizer is used. For the CamVid dataset, a batch size of 24, a maximum number of iterations of 1000, an initial learning rate of 0.01, and a weight decay of 1e-4 were used. The random scaling ranges for the Cityscapes and CamVid datasets are [0.125,1.5] and [0.5,2.5], respectively. The clip resolution of Cityscapes is 1024 x 1024, while the clip resolution of CamVid is 960 x 720. The network of the present application was implemented using PaddleSeg [30], all experiments were performed on A100 GPU.

2) Inference settings

For fairness comparison, the model is exported to ONNX and performed using tensort. In the inference process, for Cityscapes and CamVid, the inference model takes the original image as input and the resolution is 960×720. All reasoning experiments were performed in an environment consisting of RTX 3090, CUDA 11.2, cuDNN 8.2 and TensorRT 8.1.3. In quantitative evaluation, a comparison of segmentation accuracy was made using a class-of-class-wise intersection-over-unit (mIoU), and a speed comparison was made using floating point operations (float point operations, flow) and frame rate per second (frames per second, FPS).

3. Comparison of experimental results with other most advanced methods

In this section, the network of the present application was tested on Cityscapes and CamVid and compared to the most advanced model, further demonstrating the semantic segmentation capabilities of DSMONet.

Table 1 comparison with the most advanced real-time method on the CamVid test set

Table 1 shows the results of the comparison with other methods, and the training and reasoning input resolution is 960 x 720, similar to other works. Among these, DSMONet reaches 76.1% mIoU and 94.3 FPS, which is the most advanced tradeoff between performance and speed. This further demonstrates the superiority of the process of the application.

Through the training and reasoning set-up mentioned above, DSMONet is compared to the most advanced model in the urban landscape dataset. The present application proposes DSMONet-T and DSMONet-B based on two versions of the backbone network STDC1 and STDC 2. As shown in table 2, model information, resolution, segmentation accuracy, and inference speed of various methods are given.

Table 2 comparison with the most advanced real-time method on Cityscapes

Fig. 4 provides an visual comparison of segmentation accuracy and inference speed. The training set and validation set are used to train the model of the present application before the results are uploaded to the official benchmark server. Experimental evaluation shows that compared with other methods, the DSMONet proposed by the present application achieves the most advanced balance between accuracy and speed. Wherein DSMONet-T reaches 78.2% mIoU and 78.1FPS, and higher precision can be achieved at similar reasoning speed. In addition, DSMONet-B achieved 80.5% mIoU, in Table 2 obtained the best accuracy of the test set. Compared with DDRNet-23, DSMONet-B has the advantages of 8.3FPS speed and 1% mIoU precision. In the visual segmentation result of DSMONet-B on the Cityscapes validation set, DSMONet is more capable of capturing details than PP-LiteSeg and STDCNet.

4. Ablation experiments

This section will introduce ablation experiments to verify the effectiveness of each component in the method of the present application. Mainly divided into a mutual optimization module, additional losses and additional training strategies. All experiments in this section were evaluation of DSMONet-B on the Cityscapes validation set. The baseline model is DSMONet-B without the proposed module.

1) Efficient mutual optimization module

The mutual optimization module is mainly divided into SADM and DGSM. Wherein deep semantic features are decoupled intoAndhere we express asAnd。and representing the feature map after the optimization of the Laplace operator. The core component of SADM isAndand fusing to form the optimized shallow layer characteristic. Attention mechanisms are also added before the optimized detail features guide deep featuresTo enhance the feature representation. In order to verify the validity of the module, the application performs split verification. Based on the proposed mutual optimization module, DSMONet-B achieves 80.5% mIoU and 44.4FPS. The mIoU was increased by 4.4% compared to baseline model. Qualitative comparisons in Table 3 show that the additions were made sequentially,，Andthe result is more consistent with reality, especially for small objects. After the proposed modules are added step by step, the model has a significant improvement in the capture ability of details. The vehicle information in the frame is also more complete. In summary, the proposed module is effective for semantic segmentation.

So that the segmentation result is more consistent with the actual situation, especially for small objects. After the proposed modules are added step by step, the model has a significant improvement in the capture ability of details. The vehicle information in the frame is also more complete. In summary, the proposed module is effective for semantic segmentation.

Table 3 ablation experiments of mutual optimization modules

Wherein, the liquid crystal display device comprises a liquid crystal display device,is an edge feature that is provided with a feature,is edge information extracted using the laplace operator,is a feature of the main body of the device,is the mechanism of attention.

2) Effective additional loss

According to the structure of DSMONet we introduce additional losses to facilitate the optimization of the whole network. As can be seen from Table 4, additional lossesIs a necessary condition for DSMONet to obtain better performance, especially in the case of added lossesLater, the mIOU increased by 0.7%, which fully justifies the need for additional penalty, while on-line hard-example mining (OHEM) further improved accuracy.

Table 4 ablation experiments with additional losses and OHEM in DSMONet

3) Effective additional strategy

From the above analysis, to achieve a further balance of speed and accuracy, the present application uses an additional strategy: different upsampling patterns and additional attention mechanisms. The application performs ablation experiments on the selection of bilinear interpolation or transposed convolution to achieve upsampling and whether additional attention mechanisms are used. The results are shown in Table 5, and it can be seen that the selection of bilinear interpolation upsampling can achieve 79.3% mIoU and 50.7FPS, while the use of transposed convolution upsampling can achieve 80.5% mIoU and 44.4FPS. Because of the lighter model DSMONet-T, the application selects transpose convolution to promote the accuracy effectively, and the mIoU promotes by 1.2%. In order to fully utilize the feature fusion module, the application combines the channel attention and the space attention, and can reach 80.5% mIoU and 44.4FPS, thereby further balancing the speed and the precision.

Table 5 ablation experiments with additional attention and upsampling methods in DSMONet

As shown in fig. 5, in one embodiment, there is provided an image processing system based on spatial and semantic information, the image processing system comprising:

the backbone network 100 is configured to obtain a first image, and perform semantic extraction processing on the first image to obtain a second image;

the semantic adjustment detail module 200 is configured to extract semantic information from the second image, and perform semantic information detail adjustment on the first image by using the semantic information to obtain a third image;

the detail guiding semantic module 300 is configured to extract spatial information from the third image, and perform semantic information guiding processing on the second image by using the spatial information to obtain a fourth image.

FIG. 6 illustrates an internal block diagram of a computer device in one embodiment. As shown in fig. 6, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement an image processing method based on spatial and semantic information. The internal memory may also have stored therein a computer program which, when executed by a processor, causes the processor to perform an image processing method based on spatial and semantic information. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is presented, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

step S102, a first image is obtained, semantic extraction processing is carried out on the first image, and a second image is obtained;

step S104, extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image;

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which when executed by a processor causes the processor to perform the steps of:

It should be understood that, although the steps in the flowcharts of the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims

1. An image processing method based on spatial and semantic information, the image processing method comprising:

extracting space information from the third image, and carrying out semantic information guiding processing on the second image by utilizing the space information to obtain a fourth image;

the method for extracting semantic information from the second image, and carrying out semantic information detail adjustment on the first image by utilizing the semantic information to obtain a third image comprises the following steps:

decoupling the second image and extracting a first edge feature from the second image;

extracting a second edge feature from the first image, and fusing the first edge feature and the second edge feature to obtain the third image;

the decoupling the second image and extracting a first edge feature from the second image comprises the following steps:

decoupling the second image according to the decoupleSegNet, and acquiring main body characteristics of the second image;

subtracting the main body characteristic from the second image to obtain the first edge characteristic;

the step of extracting a second edge feature from the first image, merging the first edge feature and the second edge feature to obtain the third image comprises the following steps:

optimizing the first image by using a Laplace operator, and sampling the first image by using transpose convolution to obtain the second edge feature;

performing feature fusion on the first edge feature and the second edge feature to obtain the third image;

the step of extracting spatial information from the third image, and performing semantic guidance processing on the second image by using the spatial information comprises the following steps:

performing attention operation on the second image, acquiring decoupled main body characteristics from the second image, and performing addition processing on the main body characteristics and the second image subjected to channel attention and spatial attention processing to obtain a fifth image;

and extracting the spatial information from the third image, and carrying out multi-scale fusion on the spatial information and the fifth image to obtain the fourth image.

2. The method according to claim 1, wherein the semantic extraction process is performed on the first image to obtain a second image, and the method comprises the following steps:

acquiring an original image to be processed, and reducing the resolution of the original image by increasing the number of channels of a feature map to obtain the first image;

performing feature extraction on the first image in a backbone network, and performing context aggregation to obtain the second image; the resolution of the second image is lower than the resolution of the first image.

3. An image processing system based on spatial and semantic information, the image processing system comprising:

4. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the spatial and semantic information based image processing method according to any of claims 1 or 2.

5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the spatial and semantic information based image processing method according to any one of claims 1 or 2.