CN116682014B

CN116682014B - Method, device, equipment and storage medium for dividing lamp curtain building image

Info

Publication number: CN116682014B
Application number: CN202310672676.4A
Authority: CN
Inventors: 邓攀; 徐威; 华军; 冷晓宏; 刘广平; 朱岳清; 钱汇
Original assignee: Wuxi Lighting Co ltd
Current assignee: Wuxi Lighting Co ltd
Filing date: 2023-06-07
Publication date: 2024-07-05
Anticipated expiration: 2043-06-07

Abstract

The application relates to a method, a device, equipment and a storage medium for dividing a lamp curtain building image, in particular to the technical field of image detection. The method comprises the following steps: acquiring a sample building image; obtaining a pre-training model; maintaining the parameters of the trunk network and the neck feature pyramid of the pre-training model unchanged, and training the attention module and the decoupling head of the pre-training model through the sample building image to obtain a first training model; performing global training on the first training model through the sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result. Based on the scheme, the segmentation accuracy of the lamp curtain building image is improved.

Description

Method, device, equipment and storage medium for dividing lamp curtain building image

Technical Field

The application relates to the field of image detection, in particular to a method, a device, equipment and a storage medium for dividing a lamp curtain building image.

Background

The modern building technology not only can design various distinctive building structures, but also provides more expansibility side surface area for high-rise buildings, which provides larger playing space for paving the lamp curtain on the surface of the building to display various light effects.

In an actual application scene, the lamp curtain building needs to be identified, and the lamp curtain building is separated from the background, so that some use requirements on the lamp curtain building image are met. In the prior art, a sample building image is manually shot, a lamp curtain building is manually marked in the sample building image to manufacture a building image data set, and then a machine learning is combined, so that a desired building identification machine model is trained through the building image data set. And then, the target building picture can be identified through the building identification machine model.

However, the building image data set in the above scheme is small, so that the building identification accuracy is low.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for dividing a lamp curtain building image, which have high identification accuracy when the lamp curtain building image is divided.

In one aspect, a method for dividing a building image of a light curtain is provided, the method comprising:

acquiring a sample building image; the sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also comprises a sample label, and the sample label is used for labeling the building;

obtaining a pre-training model; the pre-training model comprises a backbone network, a neck feature pyramid, an attention module and a decoupling head;

maintaining parameters of a trunk network and a neck feature pyramid of the pre-training model unchanged, and training an attention module and a decoupling head of the pre-training model through a sample building image to obtain a first training model;

Performing global training on the first training model through a sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result.

In yet another aspect, there is provided a light curtain building image segmentation apparatus, the apparatus comprising:

The data acquisition module is used for acquiring a sample building image; the sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also comprises a sample label, and the sample label is used for labeling the building;

The pre-training model acquisition module is used for acquiring a pre-training model; the pre-training model comprises a backbone network, a neck feature pyramid, an attention module and a decoupling head;

the first training module is used for keeping parameters of a main network and a neck feature pyramid of the pre-training model unchanged, and training the attention module and the decoupling head of the pre-training model through a sample building image to obtain a first training model;

The second training module is used for carrying out global training on the first training model through the sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result.

In one possible implementation manner, the step of keeping parameters of the backbone network and the neck feature pyramid of the pre-training model unchanged, and training the attention module and the decoupling head of the pre-training model through the sample building image to obtain a first training model includes:

Performing feature extraction on the sample building image through a backbone network of the pre-training model to obtain at least two layers of target sample feature images;

Fusing the target sample feature images of at least two layers with the target sample feature images of all layers through a neck feature pyramid of the pre-training model to obtain sample fusion feature images of at least two layers;

processing the sample fusion feature images of at least two layers through an attention module of the pre-training model to obtain target receptive field feature images of at least two layers;

Processing the target receptive field feature images of at least two layers respectively through a decoupling head of a pre-training model to obtain a classification result;

obtaining a first loss function according to the classification result and the sample label;

And maintaining parameters of the trunk network and the neck feature pyramid unchanged, and carrying out counter-propagation updating on the attention module and the decoupling head of the pre-training model according to the first loss function to obtain a first training model.

In one possible implementation manner, the performing global training on the first training model through the sample building image to obtain a target detection model includes:

Processing the sample building image through a first training model to obtain a second loss function;

And training the first training model according to the second loss function to obtain a target detection model after global training.

In one possible implementation manner, the backbone network of the pre-training model comprises a first feature extraction module, a second feature extraction module, a third feature extraction module and a fourth feature extraction module; the at least two levels of target sample feature maps comprise a first sample feature map, a second sample feature map, a third sample feature map and a fourth sample feature map;

The feature extraction is performed on the sample building image through the backbone network of the pre-training model to obtain at least two layers of target sample feature images, including:

processing the sample building image through the first feature extraction module to obtain a first sample feature map;

Processing the first sample feature map through the second feature extraction module to obtain a second sample feature map;

processing the second sample feature map through the third feature extraction module to obtain a third sample feature map;

and processing the third sample feature map through the fourth feature extraction module to obtain a fourth sample feature map.

In one possible implementation manner, the two-level sample fusion feature map includes a first sample fusion feature map, a second sample fusion feature map, and a third sample fusion feature map;

Fusing the target sample feature images of at least two layers with the target sample feature images of all layers through a neck feature pyramid of the pre-training model to obtain sample fusion feature images of at least two layers, wherein the method comprises the following steps:

Performing first convolution processing on the fourth sample feature map to obtain a first intermediate feature map;

Upsampling the first intermediate feature map; splicing the up-sampled first intermediate feature map and the third sample feature map, and sequentially carrying out second convolution processing and first convolution processing after splicing to obtain a second intermediate feature map;

Upsampling the second intermediate feature map; splicing the up-sampled second intermediate feature map with the second sample feature map, and performing second convolution processing after splicing to obtain a first sample fusion feature map;

performing first convolution processing on the first sample fusion feature map, splicing the processed first sample fusion feature map with the second intermediate feature map, and performing second convolution processing after splicing to obtain a second sample fusion feature map;

And sequentially carrying out first convolution processing on the second sample fusion feature map, splicing with the first intermediate feature map and second convolution processing to obtain a third sample fusion feature map.

In one possible implementation manner, the processing, by the attention module of the pre-training model, the sample fusion feature map of at least two layers respectively to obtain a target receptive field feature map of at least two layers includes:

And respectively carrying out segmentation, fusion and selection on the first sample fusion feature map, the second sample fusion feature map and the third sample fusion feature map in sequence to obtain a first target receptive field feature map, a second target receptive field feature map and a third target receptive field feature map.

In one possible implementation manner, the processing, by the decoupling head of the pre-training model, the target receptive field feature map of the at least two layers respectively, to obtain a classification result includes:

processing the first target receptive field feature map, the second target receptive field feature map and the third target receptive field feature map respectively to obtain a first classification result, a second classification result and a third classification result;

splicing and transposing the first classification result, the second classification result and the third classification result to obtain the classification result.

In yet another aspect, a computer device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the above-described method of light curtain building image segmentation.

In yet another aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the above-described method of light curtain building image segmentation is provided.

In yet another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the above-described method of dividing a light curtain building image.

The technical scheme provided by the application can comprise the following beneficial effects:

The method comprises the steps of firstly, acquiring a sample building image; the sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also comprises a sample label, and the sample label is used for labeling the building; obtaining a pre-training model; the pre-training model comprises a main network, a neck feature pyramid, an attention module and a decoupling head; parameters of a trunk network and a neck feature pyramid of the pre-training model are kept unchanged, and an attention module and a decoupling head of the pre-training model are trained through a sample building image to obtain a first training model; finally, performing global training on the first training model through the sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result. According to the scheme, based on the pre-training model, parameters of a main network and a neck feature pyramid of the pre-training model are kept unchanged, a first training model is obtained by partially training an attention module and a decoupling head of the pre-training model, a target training model is obtained by performing global training on the first training model, and the segmentation accuracy of a lamp curtain building image of the target detection model can be improved under the condition that sample building images are fewer through the partial training and the global training.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram illustrating a construction of a lamp curtain building image segmentation system according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of dividing a light curtain building image according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a method of dividing a light curtain building image according to an exemplary embodiment.

Fig. 4 is a diagram showing an example of the structure of an object detection model according to an embodiment of the present application.

Fig. 5 is a block diagram illustrating a construction of a lamp curtain building image segmentation apparatus according to an exemplary embodiment.

Fig. 6 is a block diagram of a computer device, according to an example embodiment.

Detailed Description

The following description of the embodiments of the present application will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the "indication" mentioned in the embodiments of the present application may be a direct indication, an indirect indication, or an indication having an association relationship. For example, a indicates B, which may mean that a indicates B directly, e.g., B may be obtained by a; it may also indicate that a indicates B indirectly, e.g. a indicates C, B may be obtained by C; it may also be indicated that there is an association between a and B.

In the description of the embodiments of the present application, the term "corresponding" may indicate that there is a direct correspondence or an indirect correspondence between the two, or may indicate that there is an association between the two, or may indicate a relationship between the two and the indicated, configured, etc.

In the embodiment of the present application, the "predefining" may be implemented by pre-storing corresponding codes, tables or other manners that may be used to indicate relevant information in devices (including, for example, terminal devices and network devices), and the present application is not limited to the specific implementation manner thereof.

Fig. 1 is a schematic diagram illustrating a construction of a lamp curtain building image segmentation system according to an exemplary embodiment. The system for dividing the image of the building comprises a server 110 and a terminal device 120. The terminal device 120 may include a data processing device and a data storage module.

Optionally, the terminal device 120 is in communication connection with the server 110 through a transmission network (such as a wireless communication network), and the terminal device 120 may upload each data (such as image data) stored in the data storage module to the server 110 through the wireless communication network, so that the server 110 processes the acquired image data, for example, trains a convolutional neural network model applied to the aspects of dividing the building image of the lamp curtain through the uploaded image data.

Optionally, the terminal device 120 further includes an instruction input component (not shown in fig. 1), for example, a mouse, a keyboard, a touch screen, etc., where after receiving a specified instruction input by a user, the instruction input component may input corresponding data on the terminal device. For example, when the lamp curtain building image segmentation software is installed on the terminal equipment, a user can input a corresponding instruction to the terminal equipment through the instruction input component so as to control the lamp curtain building image segmentation software to output a corresponding lamp curtain building image segmentation result.

Optionally, the terminal device may upload the image file to the server 110, so that the server 110 trains the convolutional neural network model applied to the aspects of the lamp curtain building image segmentation and the like.

Optionally, the terminal device further includes a data processing device, where the data processing device may divide the image file through a convolutional neural network model issued by the server when the terminal device 120 opens the image file.

Optionally, the server 110 may obtain the image files uploaded by each terminal device, and label the image files by means of manual methods, so as to train the convolutional neural network model through the uploaded image files and labeling information, and after the training is completed, the trained convolutional neural network model may be transmitted to the terminal device, so that the terminal device performs lamp curtain building image segmentation on the image files.

Optionally, after the terminal device receives and opens the image file, the terminal device may upload the image file to the server 110, so that the trained convolutional neural network model in the server 110 segments the image file, and a segmentation result is obtained and returned to the terminal device, so as to realize online segmentation of the image file.

Optionally, the server may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and technical computing services such as big data and artificial intelligence platforms.

Optionally, the system may further include a management device, where the management device is configured to manage the system (e.g., manage a connection state between each module and the server, etc.), where the management device is connected to the server through a communication network. Optionally, the communication network is a wired network or a wireless network.

Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the internet, but may be any other network including, but not limited to, a local area network, a metropolitan area network, a wide area network, a mobile, a limited or wireless network, a private network, or any combination of virtual private networks. In some embodiments, techniques and/or formats including hypertext markup language, extensible markup language, and the like are used to represent data exchanged over a network. All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer, transport layer security, virtual private network, internet protocol security, etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

Fig. 2 is a flow chart illustrating a method of dividing a light curtain building image according to an exemplary embodiment. The method is performed by a computer device, which may be one of a terminal device and a server as shown in fig. 1. As shown in fig. 2, the method for dividing the image of the lamp curtain building may include the following steps:

Step 201, a sample building image is acquired.

The sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also includes a sample label that is used to label the building.

Image segmentation refers to dividing an image into a plurality of mutually disjoint regions according to characteristics such as gray scale, color, spatial texture, geometric shape and the like, so that the characteristics show consistency or similarity in the same region, but show obvious differences among different regions, namely, in one image, objects are separated from the background.

In order to improve the accuracy of the target detection model for image segmentation, the target detection model needs to be trained. First, a sample building image for training needs to be acquired, where the sample building image includes a sample label obtained by manually contour labeling a building in the sample building image.

Step 202, a pre-training model is obtained.

A pre-trained model refers to a model that has been trained by a technician through a large number of data sets. At present, a data set aiming at a lamp curtain building image is not available, so that a sample building image needs to be manufactured by oneself, the number of samples is insufficient, and the requirements on the number of samples and the calculation power of equipment are very high when model training is performed from the beginning, so that a pre-training model for solving similar problems can be selected as a basis of a target detection model, for example, a pre-training model capable of solving the problem of image segmentation is selected, and the structure of the pre-training model is designed as required.

Optionally, the pre-training model includes a backbone network, a neck feature pyramid, an attention module, and a decoupling head.

Step 203, keeping parameters of the trunk network and the neck feature pyramid of the pre-training model unchanged, and training the attention module and the decoupling head of the pre-training model through the sample building image to obtain a first training model.

The present application employs a two-step fine-tuning training mode, considering that the sample building image is insufficient to provide adequate training samples. The two-step fine tuning training mode is to perform the step 203 to keep the parameters of the backbone network and the neck feature pyramid unchanged, perform the partial training on the attention module and the decoupling head, and perform the global training in the step 204, where the partial training and the global training both adopt the fine tuning mode.

Step 204, performing global training on the first training model through the sample building image to obtain a target detection model.

It should be noted that the model structures of the pre-training model, the first training model and the target detection model are the same, and the three models are different in the training degree, that is, the pre-training model is an initial model, the first training model is a model after performing the partial training in step 203, and the target detection model is a model after performing the global training in step 204.

The target detection model is used for processing target building images to obtain building segmentation results. That is, after the target detection model is obtained, the target building image that the user wants to divide may be input into the target detection model, and the target building image may be divided.

In summary, the application firstly acquires a sample building image; the sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also comprises a sample label, and the sample label is used for labeling the building; obtaining a pre-training model; the pre-training model comprises a main network, a neck feature pyramid, an attention module and a decoupling head; parameters of a trunk network and a neck feature pyramid of the pre-training model are kept unchanged, and an attention module and a decoupling head of the pre-training model are trained through a sample building image to obtain a first training model; finally, performing global training on the first training model through the sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result. According to the scheme, based on the pre-training model, parameters of a main network and a neck feature pyramid of the pre-training model are kept unchanged, a first training model is obtained by partially training an attention module and a decoupling head of the pre-training model, a target training model is obtained by performing global training on the first training model, and the segmentation accuracy of a lamp curtain building image of the target detection model can be improved under the condition that sample building images are fewer through the partial training and the global training.

Fig. 3 is a flow chart illustrating a method of dividing a light curtain building image according to an exemplary embodiment. The method is performed by a computer device, which may be one of a terminal device and a server as shown in fig. 1. As shown in fig. 3, the method for dividing the image of the lamp curtain building may include the following steps:

Step 301, a sample building image is acquired.

In practical application scenarios, since there is no building image dataset for a light curtain building at present, manual acquisition and network collection are required to acquire sample building images. The sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also includes a sample label that is used to label the building.

Further, a target detection model is trained through the sample building image, and the target detection model is used for dividing the lamp curtain building image. The goal of the light curtain building image segmentation is to take a high-definition image (namely a target building image) obtained by shooting as an input, and the target detection model needs to predict the position and size information of a building in the high-definition image so as to cut out the building in the high-definition image.

First, building images are acquired in a manual shooting and web crawler manner.

Further, the building image is preprocessed, e.g., data-cleaned, to filter out valid image data (i.e., building images containing the light curtain) to form an initial building image dataset S ₀.

The main goal of preprocessing building images is to convert the initial building image dataset S ₀ into a standard building image dataset S suitable for model training and evaluation, wherein the standard building image dataset S comprises data normalization operation and data labeling operation, so that model training and effect verification can be conveniently executed.

Optionally, the building images in the initial building image dataset S ₀ are all converted to RGB format, i.e. the sample building images in the standard building image dataset S are all in RGB format.

Optionally, labeling the building in the building image by LabelImg labeling software to obtain a sample building image including the sample label.

Alternatively, only the building in the center position in the building image is labeled, i.e., only the middle building is labeled in one building image.

Further, the standard building image dataset S is proportioned into a training set S _train and a test set S _test. Optionally, the dividing ratio of the training set S _train to the test set S _test is 8:2.

By way of example, building images are acquired in a manual shooting mode, namely different buildings are shot in a surrounding mode at different angles, 29 building video clips are acquired in total, and 75 building images with different angles are intercepted in the building video clips; the building images are acquired in a web crawler mode, namely, web resources are used as a database, a crawler program is used, urban night scene buildings, night light bands of the buildings, neon light buildings and night scene building lights are used as keywords, 5762 neon light building images are collected, and then 300 building images meeting the requirements are obtained through preliminary screening. In two ways, 375 sample building images are obtained in total, and sample labels are added to serve as a standard building image data set S. To facilitate model training and effect assessment, the standard building image dataset S is divided into a training set S _train and a test set S _test at a ratio of 8:2. Wherein the training set S _train contains 300 building images in total, and the test set S _test contains 75 building images in total.

Step 302, a pre-training model is obtained.

A pre-trained model refers to a model that has been trained by a technician through a large number of data sets.

Since model training from scratch requires very high sample count and equipment power, a pre-training model that solves similar problems can be chosen as the basis for the target detection model, e.g., a multi-target detection model YOLO-X that solves the image segmentation problem can be chosen as the reference structure for the pre-training model.

Furthermore, taking the detection speed problem into consideration, a YOLOX-s model with smaller parameter volume is adopted as a reference structure of the pre-training model.

Further, a pre-training model is designed by taking YOLOX-s model as a reference structure, wherein the pre-training model comprises a main network, a neck feature pyramid, an attention module and a decoupling head.

Illustratively, the pre-training model is trained with a COCO dataset (a dataset for object detection, segmentation, and image description).

And 303, extracting features of the sample building image through a backbone network of the pre-training model to obtain at least two layers of target sample feature images.

After the sample building image and the pre-training model are obtained, the pre-training model can be trained through the sample building image. Firstly, feature extraction is carried out on the sample building image through a backbone network of a pre-training model, and at least two layers of target sample feature images are obtained.

Optionally, the backbone network of the pre-training model includes a first feature extraction module, a second feature extraction module, a third feature extraction module and a fourth feature extraction module; the at least two levels of target sample feature maps include a first sample feature map, a second sample feature map, a third sample feature map, and a fourth sample feature map.

First, the first feature extraction module is used for processing the sample building image to obtain a first sample feature map.

Further, the first sample feature map is processed through the second feature extraction module, and a second sample feature map is obtained.

Further, the second sample feature map is processed through the third feature extraction module, and a third sample feature map is obtained.

Further, the third sample feature map is processed by the fourth feature extraction module to obtain a fourth sample feature map.

Optionally, the first feature extraction module includes a first convolution processing module and a second convolution processing module. The first convolution processing module carries out convolution, normalization and activation on the input characteristic diagram in sequence. The second convolution processing module comprises two branches, wherein the first branch comprises a first convolution processing module, the second branch comprises a first convolution processing module and n first processing modules, the two branches of the second convolution processing module respectively process the input feature images and splice (concat) the results of the two branches, and the spliced results are processed through the first convolution processing module and are output. In the first processing module, the input feature images are processed sequentially through two first convolution processing modules, and then the processing result is added (add) with the input feature images to be output.

Optionally, the second feature extraction module and the third feature extraction module are the same as the first feature extraction module.

Optionally, the fourth feature extraction module sequentially includes a first convolution processing module, an SPP (spatial pyramid pooling) module, and a second convolution processing module. In the SPP module, an input feature map is processed through a first convolution processing module, then the feature map processed through the first convolution processing module is directly taken as output through four parallel branches (one branch is not subjected to pooling, the feature map processed through the first convolution processing module is respectively subjected to pooling with convolution kernel sizes of 5 multiplied by 5, 9 multiplied by 9 and 13 multiplied by 13 through the other three branches, then the feature map processed through the first convolution processing module is output), and the processing results of the four parallel branches are spliced and then are processed through the first convolution processing module, so that an output result is obtained.

Step 304, fusing the target sample feature images of at least two layers with the target sample feature images of each layer through the neck feature pyramid of the pre-training model to obtain sample fusion feature images of at least two layers.

Optionally, the two-level sample fusion feature map includes a first sample fusion feature map, a second sample fusion feature map, and a third sample fusion feature map.

Firstly, performing first convolution processing on the fourth sample feature map to obtain a first intermediate feature map;

Further, upsampling the first intermediate feature map; and splicing the up-sampled first intermediate feature map and the third sample feature map, and sequentially carrying out second convolution processing and first convolution processing after splicing to obtain a second intermediate feature map.

Further, upsampling the second intermediate feature map; and splicing the up-sampled second intermediate feature map with the second sample feature map, and performing second convolution processing after splicing to obtain a first sample fusion feature map.

Further, a first convolution process is performed on the first sample fusion feature map, the first sample fusion feature map is spliced with the second intermediate feature map after the first sample fusion feature map is processed, and a second convolution process is performed after the second sample fusion feature map is spliced, so that a second sample fusion feature map is obtained.

Further, a first convolution process is sequentially performed on the second sample fusion feature map, and the second sample fusion feature map is spliced with the first intermediate feature map to obtain a third sample fusion feature map.

It should be noted that, the module for performing the first convolution process is the same as the first convolution process module in step 303, and the module for performing the second convolution process is the same as the second convolution process module in step 303.

And 305, respectively processing the sample fusion feature graphs of at least two layers through an attention module of the pre-training model to obtain target receptive field feature graphs of at least two layers.

The attention module comprises sub-modules corresponding to the sample fusion feature graphs of each level.

Optionally, the at least two levels of target receptive field feature map comprise a first target receptive field feature map, a second target receptive field feature map, and a third target receptive field feature map.

Optionally, the first sample fusion feature map, the second sample fusion feature map, and the third sample fusion feature map are sequentially segmented (split), fused (fuse), and selected (select) to obtain a first target receptive field feature map, a second target receptive field feature map, and a third target receptive field feature map.

The segmentation part in the segmentation module of the attention module comprises a plurality of branches, convolution processing with different convolution kernel sizes is respectively carried out on the input sample feature fusion graph, and the number of the branches can be set according to actual needs. The fusion part in the sub-module of the attention module can combine the information of a plurality of branches to acquire global and comprehensive representations of the selection weights. And the selection part in the sub-module of the attention module aggregates the processing results of different branches according to the selection weight to obtain a target receptive field (convolution kernel) characteristic diagram. Through the attention module, more details of the sample building image can be acquired, and the image segmentation accuracy is improved.

Optionally, the sub-module of the attention module is an SK (SELECTIVE KERNEL, optional kernel network) attention module.

And 306, respectively processing the target receptive field feature images of at least two layers through decoupling heads of the pre-training model to obtain a classification result.

Optionally, processing the first target receptive field feature map, the second target receptive field feature map and the third target receptive field feature map respectively to obtain a first classification result, a second classification result and a third classification result;

further, the first classification result, the second classification result and the third classification result are spliced and transposed (transferred) to obtain the classification result.

Since the task of dividing the light curtain building image is essentially subordinate to the single-target detection problem, namely only buildings in the foreground are detected, other areas are regarded as background items. Therefore, the multi-category decoupling head setting of the pre-training model is simplified into a single-category model structure suitable for the lamp curtain building image segmentation task. Specifically, the application edits the decoupling heads of the pre-training model into a classification mode suitable for the lamp curtain building image segmentation task. The second classification refers to that the number of categories in the decoupling head is adjusted to be 1, namely, whether the target object is a lamp curtain building or not is classified. If yes, judging that the building is a lamp curtain building; and if not, judging that the background is the background.

Step 307, obtaining a first loss function according to the classification result and the sample label.

Illustratively, the first loss function is formulated as follows:

Wherein L _C1s is a category loss term for the target building. L _Obj is a confidence loss term for the target building, and L _Reg is a regression loss term for the location information associated with the target building. Taking the training set S _train＝{x_i in the standard building image dataset S as an input sample set, and x _i as samples in the training set, the three losses are respectively as follows:

wherein, in the category penalty term L _C1s and the confidence penalty term L _Obj, y _i indicates the actual tag value of x _i, The label value (probability value) is predicted for the model corresponding to x _i. In the regression loss term L _Reg, vectorsThe parameters in the system are respectively marked with the actual central position information and the scale information (width and height) of the target frame (sample marking) in x _i, namely the coordinates of the corresponding target building in the two-dimensional space formed by the sample building imageSize (w ⁱ,hⁱ). In the same way, the processing method comprises the steps of,The coordinates and size of the model prediction box (i.e., the segmentation result) are x _i.

It should be noted that the two categories refer in particular to the category loss term L _C1s being set to two categories, and the category loss term being 1 being the lamp curtain building; the category loss term is 0, which is background.

And 308, maintaining parameters of the backbone network and the neck feature pyramid unchanged, and carrying out back propagation update on the attention module and the decoupling head of the pre-training model according to the first loss function to obtain a first training model.

It should be noted that, step 303 to step 308 are a complete training process, and the training process may be performed multiple times as required in actual training until the predicted result achieves the desired effect.

Optionally, prior to model training, conventional presetting is performed on the model, including iteration step number, learning rate and regularization.

Exemplary hardware conditions for training the model (including the pre-processing model and the first training model) are the Intel (R) Core (TM) i7-12700KF processor, 64GB memory, and NVIDIAGeForceRTX 3060 graphics card. Training is performed under Pytorch deep learning framework.

Illustratively, a standard SGD optimizer is employed to perform model training. Wherein the training settings are as follows: the total iteration step number is 20, and the learning rate is fixed to be 0.001. To prevent overfitting, WEIGHT DECAY regularization was used, where the motion term was set to 0.9 and the weight decay factor was set to 5e-4.

In step 309, the first training model is globally trained through the sample building image to obtain a target detection model.

The present application employs a two-step fine-tuning training mode, considering that training set S _train is insufficient to provide adequate training samples. The two-step fine tuning training mode is to perform partial training (corresponding to steps 303 to 308) on the attention module and the decoupling head while keeping parameters of the backbone network and the neck feature pyramid unchanged, and then perform global training in step 309, wherein the partial training only performs training on the rear end of the first training model, and the learning rate is low, namely the rear end fine tuning training; the global training is trained in a fine tuning mode, and the learning rate is set lower than that of the partial training.

When global training is performed, firstly, processing the sample building image through a first training model to obtain a second loss function;

further, training the first training model according to the second loss function to obtain a target detection model after global training.

It should be noted that the partial training of steps 303 to 308 differs from the global training of step 309 only in terms of steps in that the parameters of the backbone network and the neck feature pyramid are kept unchanged during the partial training.

The target detection model is used for processing target building images to obtain building segmentation results.

Exemplary, the global training settings are as follows: the overall iteration step number is set to 40, and the learning rate is fixed to 0.0001. In WEIGHT DECAY regularization, the momentum term is set to 0.9 and the weight decay factor is 0.0005.

Fig. 4 is a diagram showing an example of the structure of an object detection model according to an embodiment of the present application. In fig. 4, baseConv corresponds to a first convolution processing module, CSP corresponds to a second convolution processing module, up-sample is upsampling, and SK Attention is the SK Attention module.

It should be noted that, the model structures of the pre-training model, the first training model and the target detection model in the present application are the same, and the difference between the three models is the difference of the training degrees.

The application is illustratively evaluated on models to further illustrate the effectiveness of the method of the application. The experimental procedure was as follows:

and evaluating and testing the target detection model through a test set S _test, and comparing the test result with the traditional training method.

Alternatively, according to the current target detection document, an average accuracy index for measuring the degree of overlap between the prediction frame (i.e., the segmentation result) and the real frame (i.e., the sample label) is adopted as the evaluation index. Optionally, an mAP (IoU@50) index is used as the evaluation index.

Optionally, the validity of the target detection model is evaluated by means of an ablative test. Specifically, by comparing test results under different training modes, the effectiveness of the target detection model is evaluated, and the evaluation results are shown in the following table:

table 1: performance comparison of the target detection model under different training modes.

Training mode	Learning rate	Accuracy of results
			Standard training	1.00E-2	65.41％
Standard training	1.00E-3	82.73％
			Partial training+global training	1.00E-3/1.00E-3	66.17％
Partial training+global training	1.00E-3/1.00E-4	84.44％

It can be seen from the first table that the target detection model with higher accuracy can be obtained by combining the partial training of the steps 303 to 308 and the global training of the step 309, that is, training the pre-training model by adopting the two-step fine tuning training mode.

Fig. 5 is a block diagram illustrating a construction of a lamp curtain building image segmentation apparatus according to an exemplary embodiment. The lamp curtain building image segmentation device comprises:

A data acquisition module 501, configured to acquire a sample building image; the sample building image comprises a building with a lamp curtain paved on the surface; the sample building image also comprises a sample label, and the sample label is used for labeling the building;

a pre-training model acquisition module 502, configured to acquire a pre-training model; the pre-training model comprises a backbone network, a neck feature pyramid, an attention module and a decoupling head;

A first training module 503, configured to keep parameters of the backbone network and the neck feature pyramid of the pre-training model unchanged, and train the attention module and the decoupling head of the pre-training model through the sample building image to obtain a first training model;

A second training module 504, configured to globally train the first training model through a sample building image to obtain a target detection model; the target detection model is used for processing target building images to obtain building segmentation results.

In one possible implementation manner, the step of maintaining parameters of the backbone network and the neck feature pyramid of the pre-training model unchanged, and training the attention module and the decoupling head of the pre-training model through the sample building image to obtain a first training model includes:

Processing the target receptive field feature images of at least two layers respectively through a decoupling head of the pre-training model to obtain a classification result;

and maintaining the parameters of the trunk network and the neck feature pyramid unchanged, and carrying out counter-propagation updating on the attention module and the decoupling head of the pre-training model according to the first loss function to obtain a first training model.

In one possible implementation manner, the global training of the first training model through the sample building image to obtain the target detection model includes:

In one possible implementation manner, the backbone network of the pre-training model comprises a first feature extraction module, a second feature extraction module, a third feature extraction module and a fourth feature extraction module; the at least two levels of target sample feature maps comprise a first sample feature map, a second sample feature map, a third sample feature map, and a fourth sample feature map;

the feature extraction is performed on the sample building image through a backbone network of a pre-training model to obtain at least two layers of target sample feature images, which comprises the following steps:

In one possible implementation, the two-level sample fusion feature map includes a first sample fusion feature map, a second sample fusion feature map, and a third sample fusion feature map;

In one possible implementation manner, the processing, by the decoupling head of the pre-training model, the target receptive field feature map of the at least two layers respectively to obtain a classification result includes:

Fig. 6 shows a block diagram of a computer device 600 according to an exemplary embodiment of the application. The computer device may be implemented as a server in the above-described aspects of the present application. The computer apparatus 600 includes a central processing unit (Central Processing Unit, CPU) 601, a system Memory 604 including a random access Memory (Random Access Memory, RAM) 602 and a Read-Only Memory (ROM) 603, and a system bus 605 connecting the system Memory 604 and the central processing unit 601. The computer device 600 also includes a mass storage device 606 for storing an operating system 609, application programs 610, and other program modules 611.

The mass storage device 606 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 606 and its associated computer-readable media provide non-volatile storage for the computer device 600. That is, the mass storage device 606 may include a computer-readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM) flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (DIGITAL VERSATILE DISC, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 604 and mass storage 606 described above may be collectively referred to as memory.

According to various embodiments of the present disclosure, the computer device 600 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 600 may be connected to the network 608 through a network interface unit 607 connected to the system bus 605, or alternatively, the network interface unit 607 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further comprises at least one computer program stored in the memory, and the central processing unit 601 implements all or part of the steps of the method shown in the above embodiments by executing the at least one computer program.

In an exemplary embodiment, a computer readable storage medium is also provided for storing at least one computer program that is loaded and executed by a processor to implement all or part of the steps of the above method. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform all or part of the steps of the method shown in any of the embodiments of fig. 2 or 3 described above.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for dividing a light curtain building image, the method comprising:

Obtaining a pre-training model; the pre-training model comprises a backbone network, a neck feature pyramid, an attention module and a decoupling head; the decoupling heads of the pre-training model are edited into a classification mode, and whether the target object is a lamp curtain building or not is classified;

performing global training on the first training model through a sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result;

The method for training the attention module and the decoupling head of the pre-training model through the sample building image comprises the steps of:

2. The method of claim 1, wherein the globally training the first training model with the sample building image to obtain the target detection model comprises:

3. The method of claim 1, wherein the backbone network of the pre-training model comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, and a fourth feature extraction module; the at least two levels of target sample feature maps comprise a first sample feature map, a second sample feature map, a third sample feature map and a fourth sample feature map;

4. The method of claim 3, wherein the two-level sample fusion feature map comprises a first sample fusion feature map, a second sample fusion feature map, and a third sample fusion feature map;

5. The method according to claim 4, wherein the processing, by the attention module of the pre-training model, the at least two levels of sample fusion feature maps to obtain at least two levels of target receptive field feature maps includes:

6. The method according to claim 5, wherein the processing the target receptive field feature maps of the at least two layers by the decoupling heads of the pre-training model to obtain a classification result comprises:

7. A light curtain building image segmentation apparatus, the apparatus comprising:

The pre-training model acquisition module is used for acquiring a pre-training model; the pre-training model comprises a backbone network, a neck feature pyramid, an attention module and a decoupling head; the decoupling heads of the pre-training model are edited into a classification mode, and whether the target object is a lamp curtain building or not is classified;

the second training module is used for carrying out global training on the first training model through the sample building image to obtain a target detection model; the target detection model is used for processing the target building image to obtain a building segmentation result;

The first training module is further configured to:

8. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of curtain building image segmentation of any one of claims 1 to 6.

9. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of light curtain building image segmentation of any one of claims 1 to 6.