WO2022252558A1

WO2022252558A1 - Methods for neural network training and image processing, apparatus, device and storage medium

Info

Publication number: WO2022252558A1
Application number: PCT/CN2021/137544
Authority: WO
Inventors: 王金旺
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-05-31
Filing date: 2021-12-13
Publication date: 2022-12-08
Also published as: TW202248910A; CN113344180A

Abstract

Methods for neural network training and image processing, an apparatus, a device, and a storage medium. The method for neural network training comprises: for each area among multiple areas, acquiring one or more captured images corresponding to the area, wherein, if the area corresponds to multiple captured images, at least two captured images have different capture angles (S102); using one captured image corresponding to the area as a target captured image corresponding to the area, and performing base area ground truth information tagging (S104); for each area, determining base area ground truth information tagged in the target captured image corresponding to the area to be base area ground truth information for each captured image corresponding to the area, and obtaining a training sample set on the basis of the captured images and the target captured images corresponding to each of the multiple areas, so as to perform neural network training on the basis of the training sample set (S106).

Description

Neural network training and image processing method, device, equipment and storage medium

Related Publication Cross-References

This disclosure claims the priority of the Chinese patent publication with application number 202110602248.5 filed on May 31, 2021, the entire content of which is incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technology, and in particular to a neural network training and image processing method, device, device and storage medium.

Background technique

With the gradual increase of the urbanization rate, timely statistics of buildings are required to complete tasks such as urban planning, map drawing, and building change monitoring.

At present, the building base extraction network based on the neural network is mainly used to extract the building base in the remote sensing image, and then the building base is used for building statistics.

However, the cost of data labeling is very high, and it is impossible to obtain a large number of labeled samples, and it is difficult to train a high-precision building base extraction network with a small number of labeled samples.

Contents of the invention

In view of this, the present disclosure at least discloses a neural network training method. The method may include: for each of the multiple regions, acquiring one or more frames of captured images corresponding to the region; wherein, in the case that the region corresponds to multiple frames of captured images, there are at least two frames of captured images The collected images have different collection angles; a frame of the collected image corresponding to the region is used as the target collected image corresponding to the region to mark the true value information of the base area; the marked target collected image corresponding to the region The true value information of the base area is determined as the true value information of the base area of each frame acquisition image corresponding to the area, and based on the acquisition image and the target acquisition image respectively corresponding to the multiple areas, a training sample set is obtained based on The training sample set is used for neural network training.

In some implementations shown, the method further includes: acquiring the training sample set; using the building base extraction network to obtain the roof area and offset corresponding to each collected image in the training sample set; wherein, The offset represents an offset between the roof area and the base area; for each acquired image, based on the acquired offset corresponding to the acquired image, the roof area corresponding to the acquired image is translated Transform to obtain the base area corresponding to the collected image; adjust the network parameters of the building base extraction network based on the true value information of the base area corresponding to each of the collected images and the base area respectively obtained for each of the collected images .

In some of the illustrated implementation manners, the obtaining process of the training sample set further includes: marking the base position true value information on the target acquisition image corresponding to each area; for each area, the The base position truth information marked in the target acquisition image corresponding to the area is determined as the base position truth information of each frame acquisition image corresponding to the area.

In some implementations shown, the method further includes: acquiring the training sample set; using the roof area extraction network, the offset extraction network, and the roof position extraction network included in the building foundation extraction network to obtain the The roof area, offset and roof position corresponding to each collected image in the training sample set, wherein the offset represents the offset between the roof area and the base area; based on the base area corresponding to each of the collected images True value information, and for the roof area and offset obtained respectively for each of the collected images, adjust the network parameters of the roof area extraction network; based on the true value information of the base position corresponding to the respective collected images, and for the The roof positions and offsets obtained by each of the collected images are adjusted, and the network parameters of the roof position extraction network and the offset extraction network are adjusted.

In some implementations shown, the roof area extraction network is adjusted based on the ground truth information of the base area corresponding to each of the collected images, and the roof area and offset respectively obtained for each of the collected images. The network parameters include: for each frame of image in the collected images, using the offset corresponding to the image to translate the true value information of the base area corresponding to the image to obtain the first frame corresponding to the image Roof area true value information; based on the first roof area true value information corresponding to the image and the roof area obtained for the image, the roof area loss information corresponding to the image is obtained; based on the respective collected images corresponding to The roof area loss information is adjusted by backpropagation to the network parameters of the roof area extraction network.

In some implementations shown, the roof position extraction network is adjusted based on the ground truth information of the base positions corresponding to the collected images, and the roof positions and offsets respectively obtained for the collected images. Extracting the network parameters of the network with the offset includes: for each frame of image in the collected images, using the offset corresponding to the image to translate the roof position corresponding to the image to obtain the The base position corresponding to the image; based on the base position true value information corresponding to the image and the base position obtained for the image, the base position loss information corresponding to the image is obtained; based on the base position loss corresponding to each collected image information, adjust the network parameters of the roof position extraction network and the offset extraction network through backpropagation.

In some illustrated implementations, the roof area extraction network, the offset extraction network and the roof position extraction network share a feature extraction network.

In some of the illustrated implementations, at least part of the collected images of the training sample set are also labeled with the second roof region true value information, the real offset and the roof position true value information; the method also includes at least one of the following : Adjusting the network parameters of the roof area extraction network based on the ground truth information of the second roof region marked on the at least part of the captured image and the roof region obtained for the at least part of the captured image; based on the at least part of the captured image marked adjusting the network parameters of the offset extraction network based on the actual offset and the offset obtained for the at least part of the collected images; The roof position is obtained by collecting images, and the network parameters of the roof position extraction network are adjusted.

In some of the illustrated implementations, the at least part of the collected images are also labeled with the true value information of the building frame; the method further includes: using the building frame extraction network included in the building base extraction network to extract the A building frame corresponding to at least part of the collected image; wherein, the building frame extraction network includes the feature extraction network; based on the true value information of the building frame marked on the at least part of the collected image and the obtained for the at least part of the collected image adjusting the network parameters of the building frame extraction network.

In some of the illustrated implementations, the method further includes: using the collected images marked with the real value information of the second roof area, the real offset and the real value information of the roof position in the training sample set, to construct the building The base extraction network is pretrained.

In some implementations shown, the acquired images in the training sample set are marked with a first real offset; the method further includes: using the offset extraction network to obtain a plurality of rotated images corresponding to various The second predicted offset corresponding to the preset angles; the second predicted offset indicates the offset between the roof and the base in the rotated image; obtained by rotating the various preset angles; respectively rotating the first real offset by the various preset angles to obtain second real offsets respectively corresponding to the various preset angles; based on The second actual offset corresponding to the various preset angles and the obtained second predicted offset adjust the network parameters of the offset extraction network.

In some of the illustrated implementation manners, the using the offset extraction network to obtain the second predicted offsets respectively corresponding to various preset angles from multiple rotated images includes: for the various preset angles For each preset angle, use the offset extraction network to rotate the first image feature corresponding to the collected image by the preset angle to obtain the second image feature corresponding to the preset angle; based on the The second image feature is used to obtain a second predicted offset corresponding to the preset angle.

The present disclosure also proposes an image processing method, including: receiving a remote sensing image to be processed; using a building base extraction network to extract the roof area and offset of the building in the remote sensing image to be processed; wherein, the building base The extraction network is obtained by training the neural network training method shown in any of the aforementioned implementations, the offset represents the offset between the roof area and the base area; the roof area is translated by using the offset transform to obtain the building base area corresponding to the remote sensing image to be processed.

The present disclosure also proposes a neural network training device, including: an acquisition module, configured to acquire, for each of multiple areas, one or more frames of images corresponding to the area; wherein, in the area corresponding to In the case of multiple frames of captured images, there are at least two frames of the captured images with different capture angles; the first labeling module is configured to use one frame of the captured image corresponding to the region as the target captured image corresponding to the region Annotate the true value information of the base area; the first determination module is used to determine the true value information of the base area marked by the target acquisition image corresponding to the area as the true value information of the base area of each frame acquisition image corresponding to the area and obtaining a training sample set based on the collected images and the target collected image respectively corresponding to the multiple regions, so as to perform neural network training based on the training sample set.

The present disclosure also proposes an image processing device, including: a receiving module, configured to receive remote sensing images to be processed; an extraction module, configured to use a building base extraction network to extract building roof areas and offset displacement; wherein, the building base extraction network is obtained by training the neural network training method shown in any of the foregoing implementations, and the displacement represents the offset between the roof area and the base area; the translation module, It is used for performing translation transformation on the roof area by using the offset to obtain the building base area corresponding to the remote sensing image to be processed.

The present disclosure also proposes an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein, the processor executes the executable instructions to implement any of the neural network training methods described above And/or the image processing method described above.

The present disclosure also proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to enable a processor to execute any of the above-mentioned neural network training methods and/or the above-mentioned image processing methods .

In the aforementioned scheme, firstly, since the base of buildings in the same area will not change, after image registration is performed on the acquired images collected in the same area, the area and position of the base of the building in each acquired image are identical. That is, to mark the true value information of the base area of the target acquisition image in the same area, it can be regarded as marking the true value information of the base area for each frame acquisition image in this area, so as to carry out sample expansion, that is, through a small amount of The labeling operation obtains a large number of training samples.

Second, the training sample set obtained by sample expansion based on the characteristics that the same building base area will not change can be used to train the building base prediction network, which is helpful to use a small number of labeled samples to train high-precision buildings. object base extraction network.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

In order to more clearly illustrate the technical solutions in one or more embodiments of the present disclosure or related technologies, the following will briefly introduce the drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, the accompanying drawings in the following description The drawings are only some embodiments described in one or more embodiments of the present disclosure, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

Fig. 1 is a method flowchart of a neural network training method shown in the present disclosure;

FIG. 2 is a schematic flow chart of a neural network training method shown in the present disclosure;

FIG. 3 is a schematic diagram of a building base area extraction process shown in the present disclosure;

FIG. 4 is a schematic diagram of a building base area extraction process shown in the present disclosure;

5 is a schematic flow chart of a neural network training method shown in the present disclosure;

FIG. 6 is a method flowchart of a neural network training method shown in the present disclosure;

FIG. 7 is a schematic flowchart of a neural network training method shown in the present disclosure;

FIG. 8 is a schematic diagram of a building base extraction network training process shown in the present disclosure;

FIG. 9 is a schematic diagram of a building base extraction network training process shown in the present disclosure;

FIG. 10 is a schematic structural diagram of a neural network training device shown in the present disclosure;

FIG. 11 is a schematic diagram of a hardware structure of an electronic device shown in the present disclosure.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this disclosure. Rather, they are merely examples of devices and methods consistent with aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a", "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if", as used herein, could be interpreted as "at" or "when" or "in response to a determination", depending on the context.

The present disclosure aims to propose a neural network training method. This method takes advantage of the fact that the base area of the same building will not change, and shares the ground truth information of the base area between the multi-frame acquisition images corresponding to the same area, so as to achieve the effect of expanding the training samples, which in turn helps to utilize a small number of labeled samples , to train a high-precision building base extraction network.

Please refer to FIG. 1 . FIG. 1 is a method flowchart of a neural network training method shown in the present disclosure. The neural network training method can be applied to electronic equipment. Wherein, the electronic device may implement the method by carrying a software device corresponding to the neural network training method. The type of the electronic device may be a notebook computer, a computer, a server, a mobile phone, a PAD terminal and the like. The type of the electronic device is not particularly limited in the present disclosure. The electronic device may be a client device or a server device. The server device may be a cloud. In the following, an electronic device (hereinafter referred to as device) is taken as an example for description.

As shown in Figure 1, the method may include:

S102. For each of the multiple regions, acquire one or more frames of captured images corresponding to the region; wherein, in the case where the region corresponds to multiple frames of captured images, there are at least two frames of the captured images With different acquisition angles.

The captured image may be captured by any image capturing device capable of capturing images of the multiple regions. Among the multi-frame acquisition images collected for the same area, there are at least two frames of acquisition images with different acquisition angles, thereby enriching the information contained in the training samples and improving the adaptability of the neural network.

The collected images may be classified and stored in the storage medium according to regions. The device can acquire the collected images from the storage medium.

In some implementations, the acquired images may include multi-temporal images acquired for the plurality of regions. The multi-temporal image may refer to multiple frames of remote sensing images collected for the same area at different times.

S104. Use a frame of captured image corresponding to the area as a target captured image corresponding to the area to mark the true value information of the base area.

The target captured image may be arbitrarily selected from one or more frames of captured images corresponding to the region with a resolution that meets the standard.

In some implementation manners, at least one frame of the captured image may be respectively selected from the captured images corresponding to each region as the target captured image. Then, mark the true value information of the base area by pre-marking.

Wherein, the ground truth information of the base area may be pixel level ground truth information. The true value information of the base area may be to set the value of the pixel points in the base area of the building in the remote sensing image to 1, and set the value of the pixel points outside the base area to 0.

S106, for each area, determine the base area truth information marked in the target acquisition image corresponding to the area as the base area truth information of each frame acquisition image corresponding to the area, based on the corresponding The image is collected and the target image is collected, and a training sample set is obtained to perform neural network training based on the training sample set.

In some implementation manners, the base region ground truth information marked for the target acquisition image corresponding to each region in S104 can be used as the truth value information corresponding to each acquisition image in each region, thereby achieving the purpose of expanding training samples.

Since the base of the building in the same area will not change, after image registration is performed on the acquired images collected in the same area, the area and position of the base of the building in each acquired image are the same. That is, to mark the true value information of the base area of any frame of the image in the same area and use it as the target acquisition image corresponding to the area, it can be regarded as that the true value information of the base area has been marked for each frame of the acquired image in the area, so that Sample expansion is carried out, that is, a large number of training samples are obtained through a small amount of labeling operations.

In some implementation manners, neural network training can be performed based on the obtained training sample set.

Please refer to FIG. 2 . FIG. 2 is a schematic flowchart of a neural network training method disclosed publicly.

As shown in Figure 2, the method includes:

S202. Acquire the training sample set.

S204, using the building base extraction network to extract base areas corresponding to the collected images in the training sample set.

S206. Adjust the network parameters of the building base extraction network based on the ground truth information of the base area corresponding to the respective collected images and the base area respectively obtained for the respective collected images.

In some implementation manners, the device may execute S202 in response to the network training request.

In some implementation manners, the training sample set may be stored in a storage medium, so that the device can obtain the stored training sample set from the storage medium. Afterwards, the device may perform S204-S206.

There are at least two ways to extract building bases that may be included in the present disclosure. First, the building base extraction network (hereinafter referred to as the base extraction network) can be used to directly extract the building base; second, the base extraction network can be used to first extract the building roof and the offset indicating the offset between the roof and the base , and then transform the roof indirectly through the offset to get the base.

The training methods of the base extraction network corresponding to different methods are different. The following describes the embodiments of the two modes respectively.

(1) The method of directly extracting the base of the building.

Please refer to FIG. 3 . FIG. 3 is a schematic diagram of a process of extracting a building base area shown in the present disclosure.

As shown in Figure 3, the base area can be obtained directly after the remote sensing image is input into the base extraction network.

The base extraction network shown in FIG. 3 may be a network constructed based on a target detection network. In some implementations, the target detection network can be based on RCNN (Region Convolutional Neural Network, regional convolutional neural network), FAST-RCNN (Fast Region Convolutional Neural Network, fast regional convolutional neural network), FASTER-RCNN ( Faster Region Convolutional Neural Network, Faster Regional Convolutional Neural Network) or MASK-RCNN (Mask Region Convolutional Neural Network, Mask Region Convolutional Neural Network).

In some implementations, in order to improve the accuracy of base region extraction, a MASK-RCNN with higher accuracy for region representation can be used. The MASK-RCNN may include RPN (Region Proposal Network, candidate frame generation network), and RoI Align (Region of Interest Align, region of interest alignment) unit, etc.

Among them, the RPN network is used to generate candidate frames corresponding to each building in the collected image. After the candidate frame is obtained, the regression and classification of the candidate frame can be performed to obtain the frame corresponding to each building. The RoI Align unit is used to extract the visual features corresponding to the building from the collected image according to the frame corresponding to the building. Afterwards, the corresponding visual features of the building can be used to extract the base area, roof area, offset, and roof position according to the functional requirements of the target detection network.

After obtaining the training sample set, based on the above-mentioned method of directly extracting the building base, the neural network training method may include: when executing S204, the device may input each collected image in the training sample set into the base extraction network for base extraction, and obtain each Acquire images respectively corresponding to the pedestal area.

Then, when performing S206, the preset loss function can be used to obtain the base area loss information corresponding to each acquired image according to the ground truth information of each acquired image marked base area and the base area corresponding to each acquired image. Afterwards, the backpropagation method can be used to adjust the network parameters of the base extraction network after the descent gradient is obtained.

Thus, after performing multiple rounds of training, the network training is completed, and the trained building base extraction network is obtained.

In this scheme, the training sample set obtained by sample expansion based on the characteristics that the same building base area will not change can be used for building base extraction network training, which helps to use a small number of labeled samples to train high-precision The building base extraction network for .

(2) The method of indirect extraction of the building base.

Please refer to FIG. 4 . FIG. 4 is a schematic diagram of a process for extracting building plinth regions shown in the present disclosure.

As shown in Fig. 4, after inputting the remote sensing image into the base extraction network, the roof area of the building and the offset indicating the offset between the roof and the base can be obtained first. The offset can then be used to transform (for example, translate) the roof area to obtain the base area.

The base extraction network shown in FIG. 4 may include a roof area extraction network and an offset extraction network. The roof area extraction network and the offset extraction network may be networks constructed based on a target detection network. The target detection network can be any one of RCNN, FAST-RCNN, FASTER-RCNN or MASK-RCNN. In some implementations, in order to improve the accuracy of base region extraction, a MASK-RCNN with higher accuracy for region representation can be used.

In some implementations, the roof area extraction network and the offset extraction network may share a feature extraction network. The shared feature extraction network can include a backbone network, regional feature extraction units, etc. This can simplify the network structure and facilitate network training. In response to the fact that the roof area extraction network and the offset extraction network are MASK-RCNN, the two networks can also share RPN, RoI Align units, and the like.

Please refer to FIG. 5 , which is a schematic flowchart of a neural network training method shown in the present disclosure.

As shown in Figure 5, after obtaining the training sample set, based on the above-mentioned method of indirectly extracting the building base, the neural network training method may include:

S502. Using the building base extraction network, obtain the roof area and offset corresponding to each collected image in the training sample set; wherein the offset represents an offset between the roof area and the base area.

In some implementation manners, the roof area extraction network and the offset extraction network included in the building base extraction network may be used to extract the roof area and offset corresponding to the collected images respectively.

S504. For each collected image, based on the acquired offset corresponding to the collected image, perform translation transformation on the roof area corresponding to the collected image, to obtain a base area corresponding to the collected image.

In some implementation manners, a translation operation may be performed on each pixel contained in the roof area to obtain the base area.

S506. Adjust the network parameters of the building base extraction network based on the ground truth information of the base area corresponding to the collected images and the base area respectively obtained for the collected images.

In some implementations, the preset loss function can be used to obtain the base area loss information corresponding to each acquired image according to the ground truth information of the base area marked for each acquired image and the base area corresponding to each acquired image. . Afterwards, the backpropagation method can be used to adjust the network parameters of the base extraction network after obtaining the descending gradient.

In the scheme, on the one hand, by first extracting the roof area and offset of the building, and then transforming the roof area by the offset, the base area of the building can be obtained indirectly, and the features of the roof area and the offset in the collected image can be used The characteristic of significant quantitative features improves the accuracy of base extraction, and even when the building base is blocked, it can also obtain a higher-precision building base. On the other hand, the training sample set obtained by sample expansion based on the characteristics that the same building base area will not change can be used for building base extraction network training, which is helpful to use a small number of labeled samples to train high-precision Building base extraction network.

In some implementations, the feature that the shape and position of the base area of the same building will not change can be used to share the true value information of the base area and the true value information of the base position among the multi-frame acquisition images corresponding to the same area, so as to achieve the expansion The effect of training samples, which in turn helps to train a high-precision building base extraction network with a small number of labeled samples.

Please refer to FIG. 6 . FIG. 6 is a method flowchart of a neural network training method shown in the present disclosure. As shown in Figure 6, the method may include:

S604. Mark the base position true value information on the collected images of the target corresponding to each of the regions.

In some implementation manners, the true value information of the base position may be marked in advance. The ground position information of the base may include the coordinates of the center pixel in the base area, and the width and height information of the base area. In some implementation manners, R=(cx, cy, w, h) may be used to represent the ground truth information of the base position. Among them, cx, cy represent the horizontal and vertical coordinates of the center pixel of the base area, respectively, and w, h represent the width and height of the base area, respectively.

S606. For each area, determine the base position truth information marked in the target acquisition image corresponding to the area as the base position truth information of each frame acquisition image corresponding to the area.

In some implementation manners, the base position truth information marked for the target acquisition image corresponding to each area in S604 can be used as the truth information corresponding to each acquisition image in each area, thereby achieving the purpose of expanding training samples. Thus, each acquired image in the obtained training sample set is marked with the ground truth information of the base area and the ground truth information of the base position.

Please refer to FIG. 7 , which is a schematic flow chart of a neural network training method disclosed publicly.

As shown in FIG. 7, the method may include S702-S708.

S702. Acquire the training sample set.

S704, using the roof area extraction network, offset extraction network, and roof position extraction network included in the building base extraction network to obtain the roof area, offset, and roof position corresponding to each collected image in the training sample set, wherein , the offset characterizes the offset between the roof area and the base area;

S706. Adjust the network parameters of the roof area extraction network based on the ground truth information of the base area corresponding to each of the collected images, and the roof area and offset respectively obtained for each of the collected images;

S708. Based on the true value information of the base positions corresponding to the collected images, and the roof positions and offsets respectively obtained for the collected images, adjust the relationship between the roof position extraction network and the offset extraction network. Network parameters.

Among them, S706 and S708 do not have a strict execution sequence. For example, S706 and S708 may be executed in parallel. The present disclosure does not specifically limit the execution sequence of S706 and S708.

The neural network training method can be applied to electronic equipment.

In some implementation manners, the device may execute S702 to acquire the training sample set from a storage medium in response to the network training request.

Afterwards, the device may execute S704-S708.

The building base extraction network (hereinafter referred to as the base extraction network) may be a network constructed based on a target detection network. In some implementations, in order to improve the accuracy of base area extraction, MASK-RCNN with higher accuracy for area representation can be used as the target detection network.

The base extraction network may include a roof area extraction network, an offset extraction network, and a roof position extraction network. Wherein, the roof area extraction network can be used to extract building roof areas. The offset extraction network can be used to extract the offset between the roof and the base. The roof position extraction network may be used to extract roof positions. The offset can then be used to transform (for example, translate) the roof area to obtain the base area. At the same time, the position of the roof can be translated to obtain the position of the base through the offset.

Please refer to FIG. 8 . FIG. 8 is a schematic diagram of a network training process for building base extraction shown in the present disclosure.

The base extraction network shown in FIG. 8 includes a roof area extraction network, an offset extraction network and a roof position extraction network. Among them, the roof area and offset extracted by the roof area extraction network and the offset extraction network can be translated and transformed to obtain the base area.

When training the network, the network can be modified to add base area loss information determination branch and base position loss information determination branch, so as to update network parameters according to the determined loss information. The base area loss information may represent an error between the obtained base area and the true value information of the base area. The base position loss information may represent an error between the obtained base position and the base position true value information.

In some implementation manners, when S706 is executed, S7062 may be executed, and for each frame of the image in the collected images, use the offset corresponding to the image to translate the true value information of the base area corresponding to the image , to obtain the ground truth information of the first roof region corresponding to the image. Then S7064 may be executed to obtain roof area loss information corresponding to the image based on the ground truth information of the first roof area corresponding to the image and the roof area obtained for the image. Afterwards, S7066 may be executed to adjust the network parameters of the roof region extraction network through backpropagation based on the roof region loss information respectively corresponding to the collected images.

In the method of determining the base area loss information described in S502-S506 above, it is necessary to use the offset to translate the extracted roof area to obtain the base area, and then calculate the base area loss information using the ground truth information.

However, usually the size of the extracted roof area is a preset size. For example the size of the roof area is 14*14. At this time, if the predicted offset is too large, when the roof area is translated, the pixels in the roof area may be translated out of the matrix of the preset size, resulting in information loss and an accurate base area cannot be obtained Disadvantages of loss of information and failure of network convergence.

In the solutions described in S7062-S7066, the truth information of the base area is pixel-level truth information, that is, 0 or 1 is marked for each pixel in the captured image. Among them, the pixels marked as 1 can be considered as the pixels in the base area; the pixels marked as 0 can be considered as the pixels outside the base area. When performing translation transformation on the true value information of the base area, no matter how large the extracted offset is, the true value information of the base area will be translated within the corresponding captured image with a high probability, so the lack of true value information will not be caused, that is, S7062 The ground-truth information of the first roof area obtained in will not lack the actual roof-area ground-truth information. In S7064, accurate roof area loss information can be obtained based on the first roof area true value information and the roof area, so as to ensure smooth convergence of the network.

After obtaining the roof area loss information, the network parameters of the roof area extraction network can be adjusted by calculating the descent gradient and using back propagation. This enables the training of the network for roof region extraction.

In some implementations, when S708 is executed, S7082 may be executed, and for each frame of image in the collected images, using the offset corresponding to the image, the position of the roof corresponding to the image is translated to obtain the The above image corresponds to the position of the base. Then S7084 may be executed to obtain base position loss information corresponding to the image based on the base position truth information corresponding to the image and the base position obtained for the image. Afterwards, S7086 may be executed to adjust the network parameters of the roof position extraction network and the offset extraction network through backpropagation based on the base position loss information respectively corresponding to the collected images.

In some implementations, R ₀ =(cx ₀ , cy ₀ , w ₀ , h ₀ ) may be used to represent the extracted roof position. Among them, cx ₀ , cy ₀ represent the horizontal and vertical coordinates of the center pixel of the roof area, respectively, and w ₀ , h ₀ represent the width and height of the roof area, respectively. The extracted offset can be represented by O ₀ =(Δx, Δy). Among them, Δx and Δy represent the offset of the pixel point in the X-axis and Y-axis directions, respectively. The base position can be obtained by F ₀ =(cx ₀ +Δx, cy ₀ +Δy, w ₀ , h ₀ ). Then, using a preset loss function (such as a cross-entropy loss function), the base position loss information can be obtained according to the base position truth information and the base position.

After obtaining the base position loss information, the descent gradient can be calculated, and the network parameters can be updated according to backpropagation. Since the roof position and offset obtained by the roof position extraction network and the offset extraction network need to be used when extracting the base position, the roof position extraction network and the offset extraction network can be updated during the backpropagation process .

In the described embodiment, the expanded training sample set can be used to train the roof area extraction network, the roof position extraction network and the offset extraction network, so as to complete the training for the base extraction network and obtain high-precision base extraction. network.

In some implementation manners, the roof area extraction network, the offset extraction network and the roof position extraction network may share a feature extraction network such as a backbone network and an area feature extraction unit. This can simplify the network structure and facilitate network training. In some implementations, the roof region extraction network and the offset extraction network are MASK-RCNN. The roof area extraction network, the offset extraction network and the roof position extraction network can also share RPN (Region Proposal Network, candidate frame generation network), and RoI Align (Region of Interest Align, region of interest alignment) unit, etc. .

Therefore, when adjusting the parameters of the three extraction networks, the shared feature extraction network can be adjusted, so that the training process can be mutually constrained and mutually promoted, thereby improving the network training efficiency on the one hand; on the other hand, promoting the shared feature extraction network Features that are more beneficial to base region extraction are extracted, thereby improving the accuracy of base region extraction.

In some implementations, network training efficiency and network prediction accuracy can be improved through joint training.

At least part of the collected images of the training sample set may also be marked with at least one of the following information: ground truth information of the second roof area, real offset, and ground truth information of the roof position.

In some implementation manners, manual labeling may be used to label the real value information of the roof area, the real offset, and the real value information of the roof position.

When training the network through the training sample set, at least one of the following may also be included:

S802. Adjust network parameters of the roof region extraction network based on the ground truth information of the second roof region marked on the at least part of the captured image and the roof region obtained for the at least part of the collected image.

S804. Adjust network parameters of the offset extraction network based on the real offset marked by the at least part of the captured image and the offset obtained for the at least part of the captured image.

S806. Adjust network parameters of the roof position extraction network based on the roof position ground truth information marked on the at least part of the collected images and the roof position obtained for the at least part of the collected images.

In some implementation manners, when performing S802, a preset loss function (such as a cross-entropy loss function) may be used to obtain loss information according to the second roof area true value information and the obtained roof area. Gradients are then calculated based on the obtained loss information, and backpropagation is performed to adjust the network parameters of the roof region extraction network.

When executing S804, a preset loss function (such as an MSE (Mean Square Error, mean square error) loss function) may be used to obtain loss information according to the real offset and the obtained offset. Then calculate the gradient according to the obtained loss information, and perform backpropagation to update the network parameters of the offset extraction network.

When executing S806, a preset loss function (such as a Smooth L1 (smooth L1 paradigm) loss function) may be used to obtain loss information according to the true roof position information and the obtained roof position. Then the gradient is calculated according to the obtained loss information, and the network parameters of the roof position extraction network are updated by backpropagation.

In the above example, by jointly training the roof area extraction network, roof position extraction network, and offset extraction network of the shared feature extraction network, on the one hand, various learning information can be introduced, so that the training process can be mutually Constraints can promote each other, so that on the one hand, the network training efficiency can be improved; on the other hand, the shared feature extraction network can be promoted to extract features that are more beneficial to the extraction of the base area, thereby improving the accuracy of the base area extraction.

Please continue to refer to Figure 4. When training the base extraction network shown in Figure 4, due to the high cost of sample labeling, it is impossible to obtain a large number of labeled samples including real offsets, and it is impossible to train with a small number of labeled samples. High-precision base extraction network.

In some implementation manners, the collected images in the training sample set are also marked with a first real offset; the first real offset indicates the real offset between the roof and the base in the collected images.

When using the training sample set to train the offset extraction network, S402 may be executed, using the offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from multiple rotated images amount; the second predicted offset indicates the offset between the roof and the base in the rotated image; the multiple rotated images are obtained by rotating the collected image by the various preset angles respectively.

The collected image may refer to a remote sensing image marked with the first real offset. In the embodiments of the present disclosure, the offset refers to the offset between the roof and the base in the remote sensing image. For example, the roof includes 10 pixels, and the base can be obtained by translating the 10 pixels according to the offset.

The first real offset may be information indicating the real offset between the roof and the base of the building in the captured image. For example, the first real offset may be information in the form of (x, y) vector. Wherein, x and y represent the offsets of the pixel points in the roof region and the corresponding pixel points in the base region in the x-axis and y-axis directions, respectively. In some implementation manners, the offset may be marked in advance according to the real offset between the roof and the base of the building in the collected image. The present disclosure does not specifically limit the labeling manner of the offset.

The preset angle can be set according to business requirements. The number of preset angles can be determined according to the sample size that needs to be expanded. For example, if a large number of samples need to be expanded, a large number of preset angles can be set. The present disclosure does not specifically limit the value and quantity of the preset angles. The multiple preset angles are used to rotate the captured image or the image features corresponding to the captured image.

In some implementation manners, when performing S402, each preset angle may be used to generate corresponding rotation matrices respectively. Then, for each preset angle, the rotation matrix corresponding to the preset angle is used to shift each pixel included in the captured image to obtain a rotated captured image, that is, a rotated image. Afterwards, the rotated captured image can be input into the offset extraction network to extract the second predicted offset corresponding to the preset angle, thereby obtaining the second predicted offset corresponding to various preset angles quantity. It should be noted that, in some implementations, when the captured image is rotated, the feature extraction network included in the offset extraction network can be used to perform feature extraction on the captured image to obtain the first image feature, and then, the obtained The first image feature is rotated. This can reduce the amount of calculation in the rotation process, and can reduce the rotation error introduced when extracting features from the rotated image, which helps to improve the network training effect.

Then S404 may be executed to respectively rotate the first real offset by the various preset angles to obtain second real offsets respectively corresponding to the various preset angles.

In some implementation manners, when executing S404, the rotation matrices corresponding to the respective preset angles can be used to rotate the first real offset of the captured image to obtain the respective rotation matrices corresponding to the multiple preset angles of the captured image. Second real offset.

Afterwards, S406 may be executed to adjust network parameters of the offset extraction network based on the second real offset corresponding to the various preset angles and the obtained second predicted offset.

In some implementations, when performing S406, a preset loss function (such as a cross-entropy loss function) may be used, for each preset angle, after the first real offset of the captured image is rotated by the preset angle The corresponding second real offset and the obtained second predicted offset corresponding to the preset angle obtain the corresponding offset loss information after the acquired image is rotated by the preset angle. Then, based on the offset loss information corresponding to the acquired image rotated by various preset angles, the total loss is determined by methods such as summation, product, and average, and the descent gradient is calculated using the determined total loss. The network parameters of the offset extraction network are adjusted by backpropagation.

In the solution, the offset extraction network can be used to obtain second predicted offsets respectively corresponding to various preset angles, and to rotate the first real offsets by the various preset angles respectively , to obtain the second real offsets respectively corresponding to the various preset angles, and then use the second real offsets and the obtained second predicted offsets respectively corresponding to the various preset angles, Adjust the network parameters of the offset extraction network.

Therefore, after the image is rotated by a certain angle, the offset will also rotate by the angle. By rotating the image (or its image features) and the real offset, the effect of expanding the image sample with the real offset can be achieved. In this way, a small amount of labeled data with offsets can be used to train a high-precision offset extraction network.

Since in the process of rotating the captured image and the real offset, other information covered in the captured image will also be rotated, when the rotated captured image is used to train the base extraction network, the base extraction network Other branches need to fit other information of the rotated image, which increases the training time and reduces the training efficiency.

In some implementations, the rotation process of the acquired image can be placed in the offset extraction network, so that the acquired image can be rotated inside the offset extraction network without affecting the training of other branches, that is, it will not affect other branches. The convergence speed of the branch improves the network training efficiency.

When executing S402, S4022 can be executed for each preset angle among various preset angles, and the first image feature corresponding to the collected image is rotated by the preset angle by using the offset extraction network to obtain the The second image feature corresponding to the preset angle. Then S4024 may be executed to obtain a second predicted offset corresponding to the preset angle based on the second image feature.

The first image feature may refer to an image feature obtained after the collected image undergoes feature extraction processing such as several convolutional layers and pooling layers. In some implementations, the offset extraction network can be a network constructed based on MASK-RCNN. The offset extraction network can perform feature extraction on the collected image through the included backbone network and the RoI Align unit to obtain the first image feature. In some implementation manners, the aforementioned image features may be characterized by a feature map.

In some implementations, when executing S4022, the positions of the pixels in the first image feature can be transformed through rotation matrices corresponding to various preset angles to obtain second images respectively corresponding to various preset angles feature. Then when performing S4024, the second image feature can be processed by such as several convolutional layers, pooling layers, fully connected layers and mapping units (for example, softmax (soft maximum transfer function)), to obtain the offset value The result of the extraction is the second predicted offset.

When training the network shown in Figure 4, the acquired image will only be rotated within the offset extraction network, and the unrotated acquired image is still used for training for the roof region extraction network. In this way, the rotation of the acquired image can be changed in the offset extraction network, so as not to affect the training of other branches.

In some implementations, in order to facilitate the training of the offset extraction network, the spatial transformation network can be used to rotate the image, so that the rotation process becomes derivable, the gradient can be backpropagated normally, and the network can be directly trained .

In some implementations, building frame information can also be introduced during network training to form constraints on network training, thereby improving network training efficiency and helping the feature extraction network to extract features related to buildings.

The at least part of the collected images in the training sample set are also marked with true value information of the building frame. Wherein, the building frame information may be the coordinates of the central pixel point in the building area, and information such as the width and height of the building area.

When training the base extraction network, the building frame extraction network included in the building base extraction network can be used to extract the building frame corresponding to the at least part of the collected images; wherein, the building frame extraction network includes the Feature extraction network. Then the network parameters of the building frame extraction network may be adjusted based on the ground truth information of the building frame marked on the at least part of the captured image and the building frame obtained for the at least part of the captured image.

Therefore, the building frame information can be introduced during network training. Since the four extraction networks for the roof area, roof position, offset and building frame share the feature extraction network, on the one hand, the four extraction networks can be made Interrelated and shared feature extraction network, so that the supervision information of each task can be shared, and the convergence of the network can be accelerated; on the other hand, the three extraction networks for roof area, roof position, and offset can feel the complete building. The features of the object region can be improved to improve the extraction performance.

In some implementations, the network training efficiency can be improved through pre-training.

In some implementation manners, pre-training may be performed on the building base extraction network by using the collected images labeled with the ground truth information of the second roof area, the real offset and the truth information of the roof position in the training sample set.

The pre-training process may refer to the network training process shown in any of the foregoing implementation manners. In some implementations, in order to achieve the best network pre-training effect, joint training may also be used in pre-training. Wherein at least part of the collected images of the training sample set may include the true value information of six items of roof area, roof position, base area, base position, offset, and building frame. The plinth extraction network may include six extraction networks for roof area, roof location, offset, building border, plinth area loss information, and plinth location loss information sharing a feature extraction network. The six extraction networks can be used as six branches of the base extraction network. Wherein, in some implementation manners, since the shapes of the roof and the base of the building are basically the same, the base area loss information may be equivalently represented as the roof area loss information.

During pre-training, at least part of the collected images of the training sample set may be input into the base extraction network to obtain the output results of the six branches. Then, the loss information can be obtained according to the aforementioned six items of true value information labeled with the at least part of the collected images, and the output results, and then the network parameters can be updated. In this way, the six branches can be jointly trained to improve the training efficiency and training effect of the base extraction network.

In some implementations, after the pre-training is completed, the labeled collected images and unlabeled images in the training sample set can be used to randomly input the network for training. Wherein, the marked captured image may refer to at least part of the captured image marked with the aforementioned six items of true value information.

Therefore, a reasonable network training scheme can be proposed, that is, firstly, the network is systematically pre-trained by using the labeled images with rich real-value information through joint training, and then the labeled images are mixed with the unlabeled images. Fine-tuning the network parameters of the base extraction network, on the one hand, helps to train a high-precision base extraction network using a small number of labeled and collected images; on the other hand, it can improve the efficiency of network training.

Embodiments are described below in conjunction with specific training scenarios.

Please refer to FIG. 9 . FIG. 9 is a schematic diagram of a network training process for building base extraction shown in the present disclosure. The training method in this example can be deployed in any type of electronic device.

The base extraction network shown in FIG. 9 includes a network constructed based on MASK-RCNN. The network may include six branches that extract roof area, roof location, offset, building border, plinth area loss information, and plinth location loss information, respectively. Wherein, the six branches share the backbone network, the RPN candidate frame generation network (hereinafter referred to as RPN), and the RoI Align region feature extraction unit (hereinafter referred to as RoI Align). The backbone network can be VGG (Visual Geometry Group, visual geometry group) network, ResNet (Residual Network, residual network), HRNet (high-to-low resolution network, high-resolution to low-resolution network), etc., in It does not specifically limit in this disclosure.

Before training the network, multiple groups of multi-temporal images (which have been registered with the collected images) for multiple regions can be obtained. Then, at least one frame of images may be selected from each group of multi-temporal images for manual labeling to obtain a small number of labeled images, that is, the above-mentioned labeled collected images. Wherein, the labeled image may include truth value information of six items of roof area, roof position, base area, base position, offset, and building frame. It can be understood that since the shape and position of the base area of the same building in the multi-temporal image will not change, the unlabeled image in the multi-temporal image can share the ground truth information of the base area and base position with the labeled image .

When performing network training, the base extraction network may be pre-trained by using labeled images first through joint training.

During the pre-training process, multiple rounds of the following steps can be performed according to the number of pre-training iterations:

Input each labeled image into the base extraction network to obtain the roof area, roof position, offset and building frame corresponding to each labeled image.

Then, the loss information corresponding to the four branches can be obtained through the truth information of the four items of the roof area, roof position, offset, and building frame corresponding to the labeled image, and the four branches can be updated through back propagation. Network parameters.

And, based on the truth value information of the five items of roof area, roof position, offset, base area and base position corresponding to each marked image, the branch can be determined through the loss information of the base area and base position, and the base area and base position can be obtained Loss information, and through backpropagation, adjust the network parameters of the three branches of the roof area, roof position, and offset extraction network.

In the pre-training process, due to the joint training method, various learning information can be introduced, so that the training process can not only constrain each other, but also promote each other, thereby improving the network training efficiency, so that only a small number of labeled The image can initially obtain a network with a better extraction effect.

After pre-training is completed, labeled images and unlabeled images can be mixed, and randomly input to the base extraction network for training.

Among them, if the input network is a labeled image, joint training such as a pre-training process can be performed.

If the input network is an unlabeled image, the base extraction network can be used to obtain the roof area, roof position, and offset corresponding to each unlabeled image. Then you can use the base area and base position loss information to determine the branch, and the shared base area and base position true value information to get the base area and base position loss information, and update and extract the roof area, roof position, and offset through backpropagation Quantitatively extract the network parameters of these three branches of the network.

Therefore, using labeled images and unlabeled images, fine-tuning the parameters of the pre-trained network can obtain a high-precision base extraction network.

The scheme of performing pre-training first and then performing mixed training through the joint training method, firstly, can improve the efficiency of network training, so that a network with better extraction effect can be obtained by using a small number of labeled images, and the dependence on labeling work can be reduced Second, it can promote the shared feature extraction network (including backbone network and regional feature extraction unit) to extract features that are more beneficial to base region extraction, thereby improving the accuracy of base region extraction. Third, the three branches of extracting the roof area, roof position, and offset extraction network can feel the complete building area characteristics, thereby improving the performance of branch extraction.

After the trained building base extraction network is obtained through the above implementation, the building base can be extracted from the remote sensing image to be processed through the network. The specific implementation process may include:

Receive remote sensing images to be processed;

Use the building base extraction network to extract the building roof area and offset in the remote sensing image to be processed; wherein, the building base extraction network is obtained by training the neural network training method as shown in any of the aforementioned implementations , the offset characterizes the offset between the roof area and the base area;

The translation transformation is performed on the roof area by using the offset to obtain the building base area corresponding to the remote sensing image to be processed.

Wherein, the remote sensing image to be processed may be a remote sensing image collected by any collection device capable of collecting images of buildings. In some implementations, the trained building base extraction network may be a network as shown in FIG. 9 .

Therefore, on the one hand, a small number of labeled samples can be used to train a high-precision building base extraction network, which can reduce network training costs, improve network training efficiency, and further reduce base extraction costs. On the other hand, the high-precision base extraction network can be used for base extraction to improve the accuracy of building base extraction, thereby improving the statistical accuracy of buildings.

Corresponding to any of the above implementation manners, the present disclosure also proposes a neural network training device 100 .

Please refer to FIG. 10 , which is a schematic structural diagram of a neural network training device shown in the present disclosure.

As shown in Figure 10, the device 100 may include:

The acquisition module 101 is configured to acquire, for each of the multiple areas, one or more frames of captured images corresponding to the area; wherein, in the case where the area corresponds to multiple frames of captured images, there are at least two frames The collected images have different collection angles;

The first labeling module 102 is configured to use a frame of the captured image corresponding to the area as a target captured image corresponding to the area to label the true value information of the base area;

The first determination module 103 is configured to determine the ground truth information of the base area marked in the target acquisition image corresponding to the area as the ground area truth information of each frame acquisition image corresponding to the area, based on the plurality of areas Corresponding to the collected image and the target collected image respectively, a training sample set is obtained to perform neural network training based on the training sample set.

In some implementations shown, the device 100 further includes:

A first training module 106, configured to acquire the training sample set;

Using the building base extraction network to obtain the roof area and offset corresponding to each collected image in the training sample set; wherein the offset represents the offset between the roof area and the base area;

For each collected image, based on the acquired offset corresponding to the collected image, performing a translation transformation on the roof area corresponding to the collected image to obtain a base area corresponding to the collected image;

Adjusting the network parameters of the building base extraction network based on the ground truth information of the base area corresponding to the collected images and the base area respectively obtained for the collected images.

In some implementations shown, the device 100 further includes:

The second labeling module 104 is configured to label the base position true value information on the target acquisition images respectively corresponding to each area;

The second determination module 105 is configured to, for each region, determine the true value information of the base position marked in the target captured image corresponding to the region as the true value information of the base position of each frame captured image corresponding to the region.

In some implementations shown, the device 100 further includes:

A second training module 107, configured to acquire the training sample set;

Using the roof area extraction network, offset extraction network, and roof position extraction network included in the building base extraction network, obtain the roof area, offset, and roof position corresponding to each collected image in the training sample set, wherein the The above offset characterizes the offset between the roof area and the base area;

Adjusting the network parameters of the roof area extraction network based on the true value information of the base area corresponding to each of the collected images, and the roof area and the offset respectively obtained for each of the collected images;

Adjust the network parameters of the roof position extraction network and the offset extraction network based on the true value information of the base positions corresponding to the respective collected images, and the roof positions and offsets respectively obtained for the respective collected images. .

In some implementations shown, the second training module 107 is used to:

For each frame of image in the collected images, using the offset corresponding to the image, the ground truth information of the base area corresponding to the image is translated to obtain the ground truth information of the first roof area corresponding to the image;

Obtaining roof area loss information corresponding to the image based on the ground truth information of the first roof area corresponding to the image and the roof area obtained for the image;

Based on the roof area loss information corresponding to each of the collected images, the network parameters of the roof area extraction network are adjusted through back propagation.

In some implementations shown, the second training module 107 is used to:

For each frame of image in each of the collected images, using the offset corresponding to the image, the position of the roof corresponding to the image is translated to obtain the position of the base corresponding to the image;

Obtaining base position loss information corresponding to the image based on the base position truth information corresponding to the image and the base position obtained for the image;

Based on the base position loss information corresponding to each of the collected images, the network parameters of the roof position extraction network and the offset extraction network are adjusted through back propagation.

In some implementations shown, at least part of the collected images of the training sample set are also marked with the second roof area ground truth information, the real offset and the roof position ground truth information;

The device 100 also includes at least one of the following:

A first adjustment module, configured to adjust the network parameters of the roof area extraction network based on the ground truth information of the second roof area marked on the at least part of the captured image and the roof area obtained for the at least part of the captured image;

The second adjustment module is configured to adjust the network parameters of the offset extraction network based on the real offset marked by the at least part of the captured image and the offset obtained for the at least part of the captured image;

The third adjustment module is configured to adjust the network parameters of the roof position extraction network based on the roof position ground truth information marked on the at least part of the collected images and the roof position obtained for the at least part of the collected images.

In some implementations shown, the at least part of the collected images are also marked with the true value information of the building frame; the device 100 also includes:

The extraction module is configured to use the building frame extraction network included in the building base extraction network to extract the building frame corresponding to the at least part of the collected images; wherein the building frame extraction network includes the feature extraction network;

The fourth adjustment module is configured to adjust the network parameters of the building frame extraction network based on the true value information of the building frame marked on the at least part of the collected images and the building frame obtained for the at least part of the collected images.

In some implementations shown, the device 100 further includes:

The pre-training module is used to pre-train the network for extracting the building base by using the training sample set to mark the second roof area true value information, the collected images of the real offset and the roof position true value information.

In some implementations shown, the collected images in the training sample set are marked with the first real offset; the device also includes:

An offset obtaining module, configured to use the offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of rotated images; the second predicted offset indicates the an offset between the roof and the base in the rotated image; the plurality of rotated images are obtained by rotating the collected images respectively through the various preset angles;

A selection module, configured to rotate the first real offset by the multiple preset angles to obtain second real offsets respectively corresponding to the multiple preset angles;

The fourth adjustment module is configured to adjust the network parameters of the offset extraction network based on the second real offset corresponding to the various preset angles and the obtained second predicted offset.

In some of the illustrated implementation manners, the offset acquisition module is specifically configured to: use an offset extraction network for each preset angle of the plurality of preset angles to convert the acquired image to a corresponding The first image feature is rotated by the preset angle to obtain the second image feature corresponding to the preset angle;

Based on the second image feature, a second predicted offset corresponding to the preset angle is obtained.

Corresponding to any of the above implementation manners, the present disclosure further proposes an image processing device. The device can include:

A receiving module, configured to receive remote sensing images to be processed;

The extraction module is configured to use the building base extraction network to extract the building roof area and offset in the remote sensing image to be processed; wherein, the building base extraction network uses the neural network as shown in any of the foregoing implementations. The network training method is trained, and the offset characterizes the offset between the roof area and the base area;

A translation module, configured to use the offset to perform translation transformation on the roof area to obtain the building base area corresponding to the remote sensing image to be processed.

The embodiments of the neural network training device and/or image processing device shown in the present disclosure can be applied to electronic equipment. Accordingly, the present disclosure discloses an electronic device, which may include: a processor.

Memory used to store processor-executable instructions.

Wherein, the processor is configured to invoke the executable instructions stored in the memory to implement the aforementioned neural network training method and/or image processing method.

Please refer to FIG. 11 , which is a schematic diagram of a hardware structure of an electronic device shown in the present disclosure.

As shown in Figure 11, the electronic device may include a processor for executing instructions, a network interface for network connection, a memory for storing operating data for the processor, and a memory for storing neural network training devices and/or The image processing device corresponds to a non-volatile memory of instructions.

Wherein, the embodiment of the device may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where it is located. From the perspective of hardware, in addition to the processor, memory, network interface, and non-volatile memory shown in Figure 11, the electronic device where the device in the embodiment is usually based on the actual function of the electronic device can also include other Hardware, no more details on this.

It can be understood that, in order to increase the processing speed, the device corresponding instructions may also be directly stored in the memory, which is not limited herein.

The present disclosure proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program can be used to cause a processor to execute the aforementioned neural network training method and/or image processing method.

Those skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may employ a computer implemented on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) with computer-usable program code embodied therein. The form of the Program Product.

"And/or" in the present disclosure means at least one of the two, for example, "A and/or B" may include three options: A, B, and "A and B".

Each embodiment in the present disclosure is described in a progressive manner, the same and similar parts of the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

The specific embodiments of the present disclosure have been described above. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible, or may be advantageous, in certain implementations.

Embodiments of the subject matter and functional operations described in this disclosure can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware that may include the structures disclosed in this disclosure and their structural equivalents, or their A combination of one or more of them. Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e. one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. A computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

Computers suitable for the execution of a computer program may include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory and/or a random access memory. The basic components of a computer may include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or Send data to it, or both. However, it is not necessary for a computer to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.

Computer-readable media suitable for storing computer program instructions and data may include all forms of non-volatile memory, media and memory devices and may include, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, magnetic disks such as internal hard drives or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this disclosure contains many specific implementation details, these should not be construed as limitations on the scope of any disclosure or of what may be claimed, but rather as primarily describing features of particular disclosed embodiments. Certain features that are described in multiple embodiments within this disclosure can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described and even initially claimed as such, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system modules and components in the described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product, or packaged into multiple software products.

Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above are only preferred embodiments of one or more embodiments of the present disclosure, and are not intended to limit one or more embodiments of the present disclosure. Any modification, equivalent replacement, improvement, etc. should be included in the protection scope of one or more embodiments of the present disclosure.

Claims

A neural network training method, comprising:

For each of the multiple regions,

Acquiring one or more frames of captured images corresponding to the region; wherein, in the case where the region corresponds to multiple frames of captured images, there are at least two frames of the captured images with different capture angles;

Using one frame of the captured image corresponding to the area as the target captured image corresponding to the area to mark the true value information of the base area;

determining the true value information of the base area marked in the target acquisition image corresponding to the area as the true value information of the base area of each frame of the acquisition image corresponding to the area;

A training sample set is obtained based on the collected images and the target collected images respectively corresponding to the multiple regions, so as to perform neural network training based on the training sample set.
The method according to claim 1, further comprising:

Obtain the training sample set;

Using the building base to extract the network,

Obtain the roof area and offset corresponding to each of the collected images in the training sample set; wherein, the offset represents the offset between the roof area and the base area;

For each of the collected images, based on the obtained offset corresponding to the collected images, performing translation transformation on the roof area corresponding to the collected images to obtain a base area corresponding to the collected images;

Based on the ground truth information of the base area corresponding to each of the collected images and the base area obtained for each of the collected images, the network parameters of the building base extraction network are adjusted.
According to the method according to claim 1, the obtaining process of the training sample set also includes:

Annotating the true value information of the base position on the target acquisition images corresponding to each of the regions;

For each area, the base position truth information marked in the target acquisition image corresponding to the area is determined as the base position truth information of each frame of the acquisition image corresponding to the area.
The method according to claim 3, further comprising:

Obtain the training sample set;

Using the roof area extraction network, offset extraction network, and roof position extraction network included in the building base extraction network, obtain the roof area, offset, and roof position corresponding to each collected image in the training sample set, wherein the The above offset characterizes the offset between the roof area and the base area;

Adjusting the network parameters of the roof area extraction network based on the ground truth information of the base area corresponding to the collected images, and the roof area and the offset respectively obtained for the collected images;

Adjusting the roof position extraction network and the offset extraction based on the true value information of the base positions corresponding to the collected images, and the roof positions and the offsets respectively obtained for the collected images The network parameters of the network.
According to the method according to claim 4, the base area truth value information corresponding to each of the collected images, and the roof area and the offset respectively obtained for each of the collected images are used to adjust the Network parameters of the roof area extraction network, including:

For each frame of image in the collected images, use the offset corresponding to the image to translate the ground truth information of the base area corresponding to the image to obtain the truth value of the first roof area corresponding to the image information;

Obtaining roof area loss information corresponding to the image based on the ground truth information of the first roof area corresponding to the image and the roof area obtained for the image;

Based on the roof area loss information respectively corresponding to the collected images, network parameters of the roof area extraction network are adjusted through back propagation.
According to the method according to claim 4 or 5, the base position truth information corresponding to each of the collected images, and the roof position and the offset respectively obtained for each of the collected images are adjusted based on the method according to claim 4 or 5. The network parameters of the roof position extraction network and the offset extraction network include:

For each frame of image in the collected images, using the offset corresponding to the image, the position of the roof corresponding to the image is translated to obtain the position of the base corresponding to the image;

Obtaining base position loss information corresponding to the image based on the base position truth information corresponding to the image and the base position obtained for the image;

Based on the base position loss information respectively corresponding to the acquired images, network parameters of the roof position extraction network and the offset extraction network are adjusted through back propagation.
The method according to any one of claims 4 to 6, wherein the roof area extraction network, the offset extraction network and the roof position extraction network share a feature extraction network.
According to the method according to claim 7, at least part of the collected images of the training sample set are also marked with the second roof area true value information, the real offset and the roof position true value information;

The method also includes at least one of the following:

adjusting network parameters of the roof region extraction network based on the ground truth information of the second roof region marked on the at least part of the captured image and the roof region obtained for the at least part of the collected image;

adjusting network parameters of the offset extraction network based on the real offset marked by the at least partially captured image and the offset obtained for the at least partially captured image;

Adjusting network parameters of the roof position extraction network based on the roof position ground truth information marked on the at least part of the captured image and the roof position obtained for the at least part of the captured image.
According to the method according to claim 8, said at least part of the captured image is also labeled with the true value information of the building border; said method also includes:

Using the building frame extraction network included in the building base extraction network to extract the building frame corresponding to the at least part of the captured image; wherein the building frame extraction network includes the feature extraction network;

Adjusting network parameters of the building frame extraction network based on the ground truth information of the building frame marked on the at least part of the captured image and the building frame obtained for the at least part of the captured image.
The method according to any one of claims 4 to 9, further comprising:

Pre-training is performed on the building base extraction network by using the collected images marked with the true value information of the second roof area, the real offset and the true value information of the roof position in the training sample set.
According to the method according to any one of claims 4 to 10, the collected images in the training sample set are marked with the first real offset; the method also includes:

Using the offset extraction network to obtain second predicted offsets corresponding to various preset angles from multiple rotated images; the second predicted offsets indicate the roof area and base area in the rotated image The offset between them; the plurality of rotated images are obtained by rotating the collected images respectively by the various preset angles;

Rotating the first real offset by the multiple preset angles to obtain second real offsets respectively corresponding to the multiple preset angles;

Adjusting network parameters of the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles.
The method according to claim 11, said using the offset extraction network to obtain the second predicted offset corresponding to various preset angles respectively from multiple rotated images, comprising:

For each preset angle in the plurality of preset angles,

Using the offset extraction network to rotate the first image feature corresponding to the acquired image by the preset angle to obtain a second image feature corresponding to the preset angle;

Based on the second image feature, a second predicted offset corresponding to the preset angle is obtained.
An image processing method, comprising:

Receive remote sensing images to be processed;

Using the building base extraction network to extract the building roof area and offset in the remote sensing image to be processed; wherein, the building base extraction network passes the neural network training method according to any one of claims 1 to 12 Obtained by training, the offset characterizes the offset between the roof area and the base area;

The translation transformation is performed on the roof area by using the offset to obtain the building base area corresponding to the remote sensing image to be processed.
A neural network training device, comprising:

An acquisition module, configured to acquire one or more frames of captured images corresponding to the region for each of the multiple regions; wherein, in the case where the region corresponds to multiple frames of captured images, there are at least two frames of captured images The collected images have different collection angles;

The first labeling module is used to label the real value information of the base area by using a frame of the captured image corresponding to the area as the target captured image corresponding to the area;

The first determination module is configured to determine the true value information of the base area marked in the target acquisition image corresponding to the area as the true value information of the base area of each frame of the acquisition image corresponding to the area, based on the A plurality of regions respectively correspond to the collected image and the target collected image to obtain a training sample set so as to perform neural network training based on the training sample set.
An image processing device, comprising:

A receiving module, configured to receive remote sensing images to be processed;

The extraction module is configured to use the building base extraction network to extract the building roof area and offset in the remote sensing image to be processed; wherein, the building base extraction network passes through any one of claims 1 to 12 The neural network training method training obtains, and described offset characterizes the offset between the roof area and the base area;

A translation module, configured to use the offset to perform translation transformation on the roof area to obtain the building base area corresponding to the remote sensing image to be processed.
An electronic device comprising:

processor;

memory for storing processor-executable instructions;

Wherein, the processor implements the neural network training method as claimed in any one of claims 1 to 12 and/or the image processing method as claimed in claim 13 by running the executable instructions.
A computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to enable a processor to execute the neural network training method according to any one of claims 1 to 12 and/or according to claim 13 image processing method.