WO2022252557A1

WO2022252557A1 - Neural network training method and apparatus, image processing method and apparatus, device, and storage medium

Info

Publication number: WO2022252557A1
Application number: PCT/CN2021/137532
Authority: WO
Inventors: 王金旺
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-05-31
Filing date: 2021-12-13
Publication date: 2022-12-08
Also published as: CN113344195A

Abstract

The present application provides a neural network training method and apparatus, an image processing method and apparatus, a device, and a storage medium. The method may comprise: obtaining, from a plurality of second sample images by means of an offset extraction network, second predicted offsets respectively corresponding to a plurality of preset angles, each second predicted offset indicating an offset between a roof and a base in a second sample image, a first sample image being labeled with a first real offset, and the plurality of second sample images being obtained by rotating the first sample image by the plurality of preset angles respectively; rotating the first real offset by the plurality of preset angles to obtain second real offsets respectively corresponding to the plurality of preset angles, respectively; and adjusting network parameters of the offset extraction network on the basis of the second real offsets and second predicted offsets respectively corresponding to the plurality of preset angles.

Description

Neural network training and image processing method, device, equipment and storage medium

Cross References to Related Publications

This disclosure claims the priority of Chinese Patent Publication No. 2021106022362 filed on May 31, 2021, the entire content of which is incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technology, and in particular to a neural network training and image processing method, device, device and storage medium.

Background technique

With the gradual increase of the urbanization rate, timely statistics of buildings are required to complete tasks such as urban planning, map drawing, and building change monitoring.

Currently, building statistics are mainly performed by counting building bases. Among them, when counting the building base, it is necessary to use the offset extraction network generated based on the neural network to extract the offset representing the offset between the roof and the base, and use the roof area extraction network to extract the roof of the building, and then use The offset transforms the roof to obtain a base.

However, the cost of data labeling is very high, so it is impossible to obtain a large number of labeled samples including real offsets, and it is difficult to train a high-precision offset extraction network with a small number of labeled samples.

Contents of the invention

In view of this, the present disclosure at least discloses a neural network training method. The method may include: using an offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images; the second predicted offsets indicate the second sample The offset between the roof and the base in the image; the first sample image is marked with the first real offset; obtained by setting an angle; respectively rotating the first real offset by the multiple preset angles to obtain the second real offset corresponding to the multiple preset angles; based on the multiple preset angles Set the second real offset and the second predicted offset corresponding to the angles, and adjust the network parameters of the offset extraction network.

In some of the illustrated embodiments, the using the offset extraction network to obtain the second predicted offsets respectively corresponding to various preset angles from multiple second sample images includes: for the various preset angles Each preset angle in the angle is set, and the first image feature corresponding to the first sample image is rotated by the preset angle by using the offset extraction network to obtain a second image corresponding to the preset angle A feature; based on the second image feature, a second predicted offset corresponding to the preset angle is obtained.

In some of the illustrated embodiments, the offset extraction network is used to rotate the first image feature corresponding to the first sample image by the preset angle to obtain the second image feature corresponding to the preset angle. The image features include: using the space transformation network included in the offset extraction network corresponding to the preset angle to rotate the first image feature by the preset angle to obtain the image corresponding to the preset angle Second image features.

In some of the illustrated embodiments, the space transformation network includes a sampler for image rotation based on interpolation; wherein, the sampler includes a sampling grid determined based on a preset angle corresponding to the space transformation network; The sampling grid can characterize the pixel point correspondence between the first image feature and the second image feature; use the space transformation network to rotate the first image feature by a preset angle, and obtain the image corresponding to the preset angle The second image feature includes: using the sampler to determine a plurality of pixel points corresponding to each pixel point in the second image feature in the first image feature by using the sampling grid, and based on an interpolation method The pixel values of the plurality of pixel points are mapped to obtain pixel values corresponding to each pixel point in the second image feature.

In some of the illustrated embodiments, the network parameters of the offset extraction network are adjusted based on the second real offset and the second predicted offset respectively corresponding to the various preset angles , including: obtaining offset loss information respectively corresponding to the various preset angles according to the second real offset and the second predicted offset respectively corresponding to the various preset angles; based on The offset loss information respectively corresponding to the various preset angles adjusts network parameters of the offset extraction network.

In some of the illustrated embodiments, the first sample image is also marked with real roof area information; the method further includes: using a roof area extraction network to obtain roof area prediction information in the first sample image ; Wherein, the roof area extraction network and the offset extraction network share a feature extraction network and belong to the same base area extraction network; the base area extraction network is used to obtain the base area based on the obtained roof area and offset ; Based on the real roof area information and the roof area prediction information, the roof area extraction network is trained.

In some of the illustrated embodiments, the base area extraction network includes a building frame extraction network, the building frame extraction network includes the feature extraction network, and the first sample image is also marked with the building frame real information; the method further includes: using the building frame extraction network to obtain building frame prediction information in the first sample image; based on the building frame real information and the building frame prediction information, The building frame extraction network is trained.

In some of the illustrated embodiments, the method for generating the training sample set for training the offset extraction network includes: for each of the multiple regions, obtaining one or more frames corresponding to the region An original sample image; wherein, in the case where the region corresponds to multiple frames of the original sample image, there are at least two frames of the original sample image with different acquisition angles; one frame of the original sample image corresponding to the region is used as the The first sample image corresponding to the area is marked with the ground truth information of the base area; the ground area ground truth information marked in the first sample image corresponding to the area is determined as each frame corresponding to the area The ground truth information of the base area of the original sample image is based on the original sample image and the first sample image respectively corresponding to the multiple areas to obtain a training sample set.

The present disclosure proposes an image processing method, including: acquiring a first target image to be processed; using an offset extraction network to obtain second offsets respectively corresponding to various preset angles from multiple second target images; Wherein, the offset extraction network includes a network trained by using the neural network training method shown in any of the preceding embodiments; the second offset indicates the offset between the roof and the base in the second target image amount; the multiple second target images are obtained by rotating the first target image through the multiple preset angles; for each of the multiple preset angles, the first target image corresponding to the angle The two offsets are reversely rotated to obtain the reverse second offset corresponding to the angle; the reverse second offsets corresponding to the various preset angles are fused to obtain the corresponding first target image first offset.

In some of the illustrated embodiments, the method further includes: using the roof area extraction network included in the base area extraction network to obtain the roof area in the first target image; wherein the base area extraction network further includes the The offset extraction network; the base area extraction network is trained using the neural network training method shown in the foregoing implementation manner; using the first offset corresponding to the first target image, the obtained roof area Perform translation transformation to obtain the base area corresponding to the first target image.

The present disclosure also proposes a neural network training device, including: an obtaining module, configured to use an offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images; The second predicted offset indicates the offset between the roof and the base in the second sample image; the plurality of second sample images are obtained by rotating the first sample image by the various preset angles; The first sample image is marked with a first real offset; the rotation module is used to rotate the first real offset by the various preset angles respectively to obtain the various preset angles respectively A corresponding second real offset; an adjustment module, configured to adjust the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles network parameters.

The present disclosure also proposes an image processing device, including: an acquisition module, configured to acquire a first target image to be processed; an offset acquisition module, configured to use an offset extraction network to obtain the same value from a plurality of second target images The second offset corresponding to a variety of preset angles; wherein, the offset extraction network includes a network trained by using the neural network training method shown in any of the preceding embodiments; the second offset Indicating the offset between the roof and the base in the second target image; the multiple second target images are obtained by rotating the first target image respectively at the various preset angles; the reverse rotation module is used to For each of the various preset angles, reversely rotate the second offset corresponding to the angle to obtain the reverse second offset corresponding to the angle; The inverse second offsets respectively corresponding to the preset angles are fused to obtain the first offsets corresponding to the first target image.

In some of the illustrated embodiments, the device further includes: a roof area obtaining module, configured to use the roof area extraction network included in the base area extraction network to obtain the roof area in the first target image; wherein, the The base area extraction network also includes the offset extraction network; the base area extraction network is trained by using the neural network training method shown in the foregoing embodiment; the translation module is used to use the first target image corresponding to An offset, performing a translation transformation on the obtained roof area to obtain a base area corresponding to the first target image.

The present disclosure also proposes an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein, the processor executes the executable instructions to implement the neural network training method and/or image processing method.

The present disclosure also proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to make a processor execute the neural network training method and/or the image processing method.

In the solutions shown in the foregoing embodiments, first, because the offset extraction network can be used to obtain the second predicted offsets corresponding to various preset angles, and the first real offsets are respectively rotate the various preset angles to obtain second real offsets respectively corresponding to the various preset angles, and then use the sum of the second real offsets respectively corresponding to the various preset angles to obtain The second predicted offset, adjusting the network parameters of the offset extraction network.

Therefore, after the image is rotated by a certain angle, the offset will also rotate the angle. By rotating the image (or its image features) and the real offset, the effect of expanding the sample image with the real offset can be achieved. In this way, a small amount of labeled data with offsets can be used to train a high-precision offset extraction network.

Second, the rotation process of the sample image can be placed in the offset extraction network, so that the sample image rotation can be performed inside the offset extraction network without affecting the training of other branches of the comprehensive network, that is, it will not affect other The convergence speed of the branch improves the network training efficiency. The synthesis network includes the offset extraction network.

Third, STN (Spatial Transformer Network) can be used for image rotation, so that the rotation process becomes guideable, so that the gradient can be backpropagated normally, and then the offset extraction network can be directly trained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

In order to more clearly illustrate the technical solutions in one or more embodiments of the present disclosure or related technologies, the following will briefly introduce the drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, the accompanying drawings in the following description The drawings are only some embodiments described in one or more embodiments of the present disclosure, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

Fig. 1 is a method flowchart of a neural network training method shown in the present disclosure;

Fig. 2a is a schematic diagram of an offset shown in the present disclosure;

Fig. 2b is a schematic diagram of the offset after an image is rotated by 90 degrees shown in the present disclosure;

FIG. 3 is a schematic diagram of a building base extraction process shown in the present disclosure;

FIG. 4 is a schematic diagram of an offset extraction process shown in the present disclosure;

FIG. 5 is a schematic flow diagram of image rotation using a space transformation network shown in the present disclosure;

FIG. 6 is a schematic diagram of an offset extraction network training process shown in the present disclosure;

FIG. 7 is a schematic diagram of a building base extraction process shown in the present disclosure;

FIG. 8 is a method flowchart of an image processing method shown in the present disclosure;

FIG. 9 is a schematic structural diagram of a neural network training device shown in the present disclosure;

FIG. 10 is a schematic diagram of a hardware structure of an electronic device shown in the present disclosure.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of devices and methods consistent with aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a", "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if", as used herein, could be interpreted as "at" or "when" or "in response to a determination", depending on the context.

The present disclosure aims to propose a neural network training method. This method utilizes the characteristic that after the image is rotated by a certain angle, the offset will also rotate the angle. By rotating the image (or its image features) and the real offset, the effect of expanding the sample image with the real offset is achieved. In this way, a small amount of labeled data with offsets can be used to train a high-precision offset extraction network.

Please refer to FIG. 1 . FIG. 1 is a method flowchart of a neural network training method shown in the present disclosure. As shown in Figure 1, the method may include:

S102, using the offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images; the second predicted offsets indicate roofs in the second sample images The offset between the base and the base; the plurality of second sample images are obtained by rotating the first sample image by the various preset angles respectively.

S104. Rotate the first real offset by the multiple preset angles respectively to obtain second real offsets respectively corresponding to the multiple preset angles.

S106. Adjust network parameters of the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles.

The neural network training method can be applied to electronic equipment. Wherein, the electronic device can execute the method by carrying a software device corresponding to the neural network training method. The type of the electronic device may be a notebook computer, a computer, a server, a mobile phone, a PAD terminal and the like. The type of the electronic device is not particularly limited in the present disclosure. The electronic device may be a client device or a server device. The server device may be a cloud. In the following, an electronic device (hereinafter referred to as device) is taken as an example for description.

In some implementation manners, the device may execute S102 in response to the network training request.

The first sample image may refer to a remote sensing image marked with a first real offset. In the embodiments of the present disclosure, the offset refers to the offset between the roof and the base in the image. For example, the roof includes 10 pixels, and the base can be obtained by translating the 10 pixels according to the offset.

The first real offset may be information indicating the real offset between the roof and the base of the building in the first sample image. For example, the first real offset may be information in the form of (x, y) vector. Wherein, x and y represent the offsets of the pixel points in the roof region and the corresponding pixel points in the base region in the x-axis and y-axis directions, respectively. In some implementation manners, the offset may be marked in advance according to the actual offset between the roof and the base of the building in the first sample image. The present disclosure does not specifically limit the labeling manner of the offset.

Since the coordinate system of the first sample image remains unchanged, after the first sample image is rotated by a certain angle, the offset will also be rotated by the same angle.

Please refer to FIG. 2 a and FIG. 2 b , wherein FIG. 2 a is a schematic diagram of an offset shown in the present disclosure; FIG. 2 b is a schematic diagram of an offset after an image is rotated by 90 degrees shown in the present disclosure.

Before the image is rotated, the offset between the roof and the base in the image can be shown in Figure 2a. As shown in Fig. 2b, after the image is rotated 90 degrees counterclockwise, since the coordinate system remains unchanged, the offset will also be rotated 90 degrees.

After the image is rotated by a certain angle, the offset will also be rotated by the same angle. The sample image and the corresponding real offset can be rotated at various angles, so that the real offset can be easily expanded. There are labeled sample data, which can improve the network training effect.

The offset extraction network may be a network constructed based on a target detection network. The target detection network can be RCNN (Region Convolutional Neural Network, regional convolutional neural network), FAST-RCNN (Fast Region Convolutional Neural Network, fast regional convolutional neural network), FASTER-RCNN (Faster Region Convolutional Neural Network, more Fast Regional Convolutional Neural Network) or MASK-RCNN (Mask Region Convolutional Neural Network, Mask Region Convolutional Neural Network).

In some implementations, in order to improve the accuracy of offset extraction, a MASK-RCNN with higher accuracy for region representation can be used. The MASK-RCNN may include RPN (Region Proposal Network, candidate frame generation network), and RoI Align (Region of Interest Align, region of interest alignment) unit, etc.

Wherein, the RPN network is used to generate candidate frames corresponding to each building in the image. After the candidate frame is obtained, the regression and classification of the candidate frame can be performed to obtain the frame corresponding to each building. The RoI Align unit is used to extract visual features corresponding to the building from the image according to the frame corresponding to the building. The offset between the roof and the base can then be extracted using the corresponding visual features of the building.

The preset angle can be set according to business requirements. The number of preset angles can be determined according to the sample size that needs to be expanded. For example, if a large number of samples need to be expanded, a large number of preset angles can be set. The present disclosure does not specifically limit the value and quantity of the preset angles. The various preset angles are used to rotate the sample image or image features corresponding to the sample image.

In some implementation manners, when performing S102, each preset angle may be used to generate corresponding rotation matrices respectively. Then, for each preset angle, the rotation matrix corresponding to the preset angle is used to shift each pixel included in the first sample image to obtain a rotated second sample image. Afterwards, each rotated second sample image may be input into the offset extraction network, and second predicted offsets respectively corresponding to each rotated second sample image may be extracted. It should be noted that, in some implementations, when the first sample image is rotated, the feature extraction network included in the offset extraction network can be used to perform feature extraction on the first sample image to obtain the first image feature ; After that, rotate the obtained first image features. This can reduce the amount of calculation in the rotation process, and can reduce the rotation error introduced when extracting features from the rotated image, which helps to improve the network training effect.

In some implementations, when executing S104, the first real offset of the first sample image can be rotated using the rotation matrices respectively corresponding to the preset angles, so as to obtain multiple preset rotations of the first sample image. The second real offset corresponding to the angle after setting.

After the second real offset and the second predicted offset corresponding to the rotation of the first sample image by various preset angles are obtained, S106 may be executed.

In some implementations, when performing S106, a preset loss function (such as a cross-entropy loss function) may be used to rotate the first sample image by the first real offset of the first sample image for each preset angle. The corresponding second real offset after setting the angle, and the obtained second predicted offset corresponding to the preset angle obtain the corresponding offset loss information after the first sample image is rotated by the preset angle. Then, based on the offset loss information corresponding to the rotation of the first sample image by various preset angles, the total loss is determined by methods such as summation, product, and average, and the determined total loss is used to calculate Gradient descent, adjust the network parameters of the offset extraction network through backpropagation.

In the scheme, since the offset extraction network can be used to obtain the second predicted offsets respectively corresponding to various preset angles, and the first real offsets are respectively rotated by the various preset angles Angle, to obtain the second real offsets respectively corresponding to the various preset angles, and then use the second real offsets respectively corresponding to the various preset angles and the obtained second predicted offsets , adjust the network parameters of the offset extraction network.

In some implementations, the offset extraction network may be a branch of an integrated network. Therefore, in the process of rotating the sample image and the real offset, other information contained in the sample image will also be rotated. When the rotated sample image is used to train the integrated network, other information of the integrated network The branch needs to fit other information of the rotated sample image, thus increasing the training time and reducing the training efficiency.

Please refer to FIG. 3 . FIG. 3 is a schematic diagram of a building foundation extraction process shown in the present disclosure.

As shown in Figure 3, after the remote sensing image is input into the base area extraction network shown in Figure 3, the roof area extraction network can be used to extract the roof area of the building, and the offset extraction network can be used to extract the offset. The offset can then be used to transform (for example, translate) the roof area to obtain the base area.

At this time, the offset extraction network and the roof area extraction network are two branches of the base area extraction network. When using the aforementioned scheme to train the base area extraction network, since the first sample image is rotated, the roof area contained in it will also be rotated accordingly. Therefore, when training the network, the roof area extraction network (that is, the other branches mentioned above) will also be rotated. Refitting is required, resulting in a slowdown in network convergence.

In order to solve the aforementioned pain points, in some implementations, the rotation process of the first sample image can be placed in the offset extraction network, so that image rotation can be performed inside the offset extraction network without affecting other branches. Training, that is, it will not affect the convergence speed of other branches, thereby improving the efficiency of network training.

Please refer to FIG. 4 , which is a schematic diagram of an offset extraction process shown in the present disclosure.

As shown in Question 4, when S102 is executed, S402 can be executed, and for each of the various preset angles, the offset extraction network is used to convert the first image corresponding to the first sample image to The feature is rotated by the preset angle to obtain a second image feature corresponding to the preset angle. Then S404 may be executed to obtain a second predicted offset corresponding to the preset angle based on the second image feature.

The first image feature may refer to the image feature obtained after the first sample image undergoes feature extraction processing such as several convolutional layers and pooling layers. In some implementation manners, the offset extraction network may be a network constructed based on MASK-RCNN. The offset extraction network can perform feature extraction on the first sample image through the included backbone network and the RoI Align unit to obtain the first image features. In some implementation manners, the aforementioned image features may be characterized by a feature map.

In some implementations, when executing S402, the positions of the pixels in the first image feature can be transformed through rotation matrices corresponding to various preset angles to obtain second images respectively corresponding to various preset angles feature. Then when performing S404, the second image feature can be processed by such as several convolutional layers, pooling layers, fully connected layers and mapping units (for example, softmax (soft maximum transfer function)), to obtain the offset value The result of the extraction is the second predicted offset.

In some implementations, the first sample image is only rotated within the offset extraction network, and the roof region extraction network is still trained with the unrotated first sample image. In this way, the rotation of the first sample image can be changed in the offset extraction network, so as not to affect the training of other branches.

In some implementations, in order to facilitate the training of the offset extraction network, the spatial transformation network can be used to rotate the image, so that the rotation process becomes derivable, the gradient can be backpropagated normally, and the network can be directly trained .

Please refer to FIG. 5 . FIG. 5 is a schematic flowchart of image rotation using a spatial transformation network shown in the present disclosure.

The spatial transformation network (Spatial Transformer Network, STN) 50 shown in FIG. 5 may include a rotation angle generation network 51, a sampling grid 52 and a sampler 53.

Wherein, the rotation angle generation network 51 can be used for training in a self-supervised manner, and can be used to generate the rotation angle θ after the training is completed. In this example, since the rotation angle is a specified preset angle, the rotation angle generation network 51 is not used to generate the rotation angle, but the rotation angle θ is directly specified.

The sampling grid 52 can determine the corresponding relationship T _θ (G) between the pixels in the second image feature V and the pixels in the first image feature U according to the rotation angle.

The sampler 53 can respectively determine a plurality of pixel points corresponding to the pixel points in the first image feature U according to the pixel point correspondence represented by the sampling grid 52 for each pixel point in the second image feature V , and map the pixel values of the plurality of pixel points based on an interpolation manner to obtain the pixel values corresponding to the pixel points, so as to complete the image feature rotation. The interpolation methods may include polynomial interpolation, linear interpolation, bilinear interpolation and other methods. In this example, since the image is rotated by interpolation, the image rotation process becomes derivable, so that the gradient can be backpropagated normally, and the network can be trained directly.

When executing S402, for each preset angle among the plurality of preset angles, the first image may be converted to The feature is rotated by the preset angle to obtain a second image feature corresponding to the preset angle.

In some implementation manners, space transformation networks (hereinafter referred to as STNs) respectively corresponding to different preset angles may be deployed in the offset extraction network, and the rotation angle θ corresponding to each STN may be specified. Wherein, the input of each STN may be the first image feature obtained by performing feature extraction on the first sample image. In the STN, the sampling grid can be used to determine a plurality of pixel points corresponding to the pixel points in the first image feature for each pixel point of the second image feature, and through the sampler , mapping the pixel values of the plurality of pixel points based on an interpolation manner to obtain the pixel values corresponding to the pixel points.

After the STN rotation processing, the second image features respectively corresponding to the first sample image rotated by various preset angles can be obtained.

Then, through S404, the second predicted offsets corresponding to the rotations of the first sample image by various preset angles can be extracted.

Please refer to FIG. 6 , which is a schematic diagram of an offset extraction network training process shown in the present disclosure.

The offset extraction network shown in FIG. 6 may include a feature extraction unit and an offset expansion unit.

Wherein, the feature extraction unit may include a backbone network and a RoI Align unit (not shown in FIG. 6 ), for extracting building image features in the first sample image, that is, the first image feature F0.

In this example, the offset expansion unit may include 4 STN branches. As shown in FIG. 6 , the offset expansion unit can use the STN to respectively rotate the first image feature F0 by 0 degrees, 90 degrees, 180 degrees and 270 degrees to obtain the corresponding second image features F1-F4. The offset expansion unit can also use a classifier to classify the second image features F1-F4 to obtain the second predicted offset corresponding to the rotation of the first sample image by 0 degrees, 90 degrees, 180 degrees and 270 degrees . The classifier may include multiple convolutional layers, fully connected layers, and mapping units. In some implementation manners, in order to simplify the network structure, parameters of at least some convolutional layers and fully connected layers in multiple classifiers may be shared. It should be noted that the rotation angle of the first sample image may include but not limited to the above-mentioned several situations, and there is no limitation on the interval of the rotation angle, the number of rotations, etc., and may be based on the number of required sample images, etc. Factors are dynamically adjusted.

As shown in Figure 6, during the training process, the rotation matrices corresponding to 0 degrees, 90 degrees, 180 degrees and 270 degrees can be used to perform rotation transformation on the first real offset of the first sample image, and multiple second real offset.

As shown in Figure 6, in the training process, the first sample image can be obtained according to the second real offset and the second predicted offset after rotating the first sample image by various preset angles. The corresponding offset loss information L1-L4 after rotating various preset angles. Then determine the total loss based on the sum of the offset loss information L1-L4, and adjust the network parameters of the offset extraction network according to the total loss. In some implementation manners, the total loss may also be determined by means of multiplying the offset loss information L1-L4, calculating an average, etc., which are not particularly limited here.

In the training process shown in Figure 6, first, by rotating the first sample image and the first real offset, the effect of expanding the sample image with the real offset can be achieved, so that a small amount of labeled offset can be used. The labeled data of the offset is trained to obtain a high-precision offset extraction network. Second, the image rotation of the first sample can be performed inside the offset extraction network, which will not affect the training of other branches, that is, will not affect the convergence speed of other branches, thereby improving the efficiency of network training. Third, STN can be used for image rotation, so that the rotation process becomes derivable, so that the gradient can be backpropagated normally, and then network training can be performed directly.

Please continue to refer to Figure 3. When training the base area extraction network shown in Figure 3, due to the high cost of sample labeling, it is impossible to obtain a large number of labeled sample data including real offsets and real information of the roof area, and a small amount of Annotated sample data cannot train a high-precision base region extraction network.

The plinth area extraction network is used to extract plinth areas based on the obtained roof areas and offsets. The roof area extraction network included in the base area extraction network shares the feature extraction network with the offset extraction network. The feature extraction network may include a backbone network and a RoI Align unit.

In some implementations, the fact that the base area of the same building does not change can be used to share the true value information of the base area between multiple frame sample images corresponding to the same area, so as to achieve the effect of expanding the training sample set, which in turn helps Using a small amount of labeled sample data, a high-precision building base area extraction network is trained.

Wherein, the method for generating the training sample set for training the offset extraction network may include: S302, for each of the multiple regions, acquiring one or more frames of original sample images corresponding to the regions; Wherein, in the case that the region corresponds to multiple frames of original sample images, at least two frames of the original sample images have different acquisition angles.

The original sample image may be acquired by any image acquisition device capable of acquiring images of the plurality of regions. Among the multiple frames of original sample images collected for the same area, there are at least two frames of original sample images with different collection angles, thereby enriching the information contained in the training samples and improving the adaptability of the neural network.

The original sample images may be classified and stored in the storage medium according to regions. The device can acquire the original sample image from a storage medium.

In some implementations, the raw sample images may include multi-temporal images acquired for the plurality of regions. The multi-temporal image may refer to multiple frames of remote sensing images collected for the same area at different times.

Then S304 may be executed, and a frame of the original sample image corresponding to the region is used as the first sample image corresponding to the region to mark the true value information of the base region.

The original sample image as the first sample image may be arbitrarily selected from one or more frames of original sample images corresponding to the region, and an image with a standard resolution.

In some implementation manners, at least one frame of original sample images may be respectively selected from the original sample images corresponding to each region. Then, mark the true value information of the base area by pre-marking.

Wherein, the ground truth information of the base area may be pixel level ground truth information. The true value information of the base area may be to set the value of the pixel points in the base area of the building in the remote sensing sample image to 1, and set the value of the pixel points outside the base area to 0.

Afterwards, S306 can be executed. For each region, the base region truth information marked in the first sample image corresponding to the region is determined as the base region truth information of each frame of the original sample image corresponding to the region, based on The original sample images and the first sample images respectively corresponding to the multiple regions are used to obtain a training sample set.

In some implementations, the ground truth information of the base area marked for the first sample image in each area in S304 can be used as the label information corresponding to each original sample image in each area, thereby achieving the purpose of expanding the training samples .

Since the base of the building in the same area will not change, after image registration is performed on the original sample images collected in the same area, the area and position of the base of the building in each original sample image are the same. That is, the ground truth information of the base region of any frame of the original sample image in the same region is marked and used as the first sample image corresponding to the region, which can be regarded as the ground truth information of the base region for each frame of the original sample image in the region. labeling, thus performing sample expansion, that is, a large number of training samples are obtained through a small number of labeling operations.

In some implementations, the expanded training samples can be used to perform supervised training on the base area extraction network, thereby helping to use a small amount of labeled sample data to train a high-precision building base area extraction network. In some implementations, the training of the aforementioned offset extraction network can also be combined with the training of the roof area extraction network to jointly train the base area extraction network, so as to use a small amount of labeled data to train a high-precision base Region extraction network.

The first sample image is also marked with real roof area information.

When training the base region extraction network, on the one hand, the offset extraction network may be trained according to the offset extraction network training method shown in any of the foregoing implementation manners. On the other hand, roof area prediction information in the first sample image may be obtained by using a roof area extraction network. Then, based on the real roof area information and the obtained roof area prediction information, the roof area extraction network is trained. In some implementation manners, the loss information may be determined based on the real roof area information and the obtained roof area prediction information according to a preset loss function. The network parameters can then be tuned using backpropagation based on the loss information.

Therefore, first, when training the offset extraction network, the sample size can be expanded by rotating the sample image and the real offset, so as to achieve the effect of using a small amount of labeled data to train a high-precision offset extraction network. Second, when training the base area extraction network, the roof area extraction network and the offset extraction network of the shared feature extraction network can be jointly trained. On the one hand, various learning information can be introduced to make the training process It can not only constrain each other, but also promote each other, so as to improve the network training efficiency on the one hand, and achieve the effect of using a small amount of labeled data to train a high-precision base area extraction network; on the other hand, it promotes the shared feature extraction network to extract the base area. More beneficial features to improve the accuracy of base region extraction.

In some implementations, building frame information can also be introduced during network training to form constraints on network training, thereby improving network training efficiency and helping the feature extraction network to extract features related to buildings.

The first sample image is also marked with real information about the frame of the building. Wherein, the real information of the building frame may be the coordinates of the center pixel in the building area, and information such as the width and height of the building area.

The base area extraction network also includes a building frame extraction network. Wherein, the building frame extraction network includes the feature extraction network.

When training the base area extraction network, the building frame extraction network may be used to extract building frame prediction information in the first sample image. Then, the building frame extraction network can be trained based on the real building frame information and the obtained building frame prediction information.

Therefore, during network training, first, when training the offset extraction network, the sample size can be expanded by rotating the sample image and the real offset, so as to achieve high-precision offset extraction using a small amount of labeled data. network effect. Second, when training the base area extraction network, the building frame information is introduced. Since the three extraction networks of the roof area, offset and building frame share the feature extraction network, on the one hand, the three extraction networks can be mutually Association, through the shared feature extraction network, the supervision information of each task can be shared, and the convergence of the network can be accelerated to achieve the effect of using a small amount of labeled data to train a high-precision base area extraction network; on the other hand, the roof area and offset can be doubled. An extraction network perceives the complete building area features, thereby improving the extraction performance.

Embodiments are described below in combination with training scenarios.

Referring to FIG. 7 , FIG. 7 is a schematic diagram of a building foundation extraction process shown in the present disclosure. The training method in this example can be deployed in any type of electronic device.

The base region extraction network shown in FIG. 7 includes a network constructed based on MASK-RCNN. The network can include 3 branches that extract the roof area, offset, and building border, respectively. Wherein, the three branches share the backbone network, the RPN candidate frame generation network (hereinafter referred to as RPN), and the RoI Align region feature extraction unit (hereinafter referred to as RoI Align). The backbone network can be VGG (Visual Geometry Group, visual geometry group) network, ResNet (Residual Network, residual network), HRNet (High-to-low Resolution network, high-resolution to low-resolution network), etc., in It does not specifically limit in this disclosure.

Wherein, the offset extraction branch may include the offset expansion unit shown in FIG. 6 . The base area can be obtained by transforming the roof area by the obtained offset.

Before performing the network training, several first sample images marked with the first real offset, the real information of the roof area, and the real information of the building frame can be obtained.

Depending on the number of training iterations, multiple rounds of the following steps can then be performed to complete the network training:

S71. Input several first sample images into the base area extraction network respectively.

Wherein, the backbone network and the RoI Align unit included in the base area extraction network can be utilized to perform feature extraction on each first sample image, so as to obtain the corresponding first image features of each first sample image.

Then in the offset extraction branch, the first image features corresponding to each first sample image can be rotated through the STNs corresponding to 0 degrees, 90 degrees, 180 degrees and 270 degrees respectively, and the offset is extracted to obtain The second prediction offsets corresponding to each first sample image after rotating 0 degree, 90 degree, 180 degree and 270 degree respectively.

In the branch of extracting the roof area and the branch of extracting the frame of the building, the prediction information of the roof area and the prediction information of the building frame corresponding to each first sample image can be obtained.

Then S72 may be executed to jointly train the three branches using real information.

Wherein, when training the offset extraction branch, the first real offset corresponding to each first sample image can be rotated and transformed through the rotation matrices corresponding to 0 degrees, 90 degrees, 180 degrees and 270 degrees respectively, to obtain A number of second true offsets. Then, according to the preset loss function, based on the second real offset and the obtained second predicted offset after each first sample image is rotated by 0 degrees, 90 degrees, 180 degrees and 270 degrees respectively, the round Loss information from training. Afterwards, the descent gradient can be determined, and the network parameters included in the branch can be extracted by adjusting the offset through backpropagation.

When training the roof area extraction branch and the building frame extraction branch, conventional training methods can be used to train the two branches according to the real information of the roof area and the real information of the building frame.

In the scheme, first, the sample size can be expanded by rotating the sample image and the real offset, so as to achieve the effect of using a small amount of labeled data to train a high-precision offset extraction branch. Second, the rotation transformation of image features can be performed in the offset extraction branch without affecting the training of other branches, which improves the efficiency of network training. Third, the joint training method is adopted to enable the network to learn various information, and the training between the branches supervises and promotes each other, which improves the network training efficiency and achieves the goal of using a small amount of labeled data to train a high-precision base area extraction network. Effect. Fourth, feature extraction networks such as the shared backbone network can extract features that are more beneficial to base region extraction, thereby improving the accuracy of base region extraction.

The disclosure also proposes an image processing method. This method can extract the second offsets corresponding to rotating the first target image to be processed by various preset angles, and then perform inverse transformation on the multiple second offsets, and perform fusion to obtain a more robust and accurate image. Offset.

Please refer to FIG. 8 . FIG. 8 is a method flowchart of an image processing method shown in the present disclosure. As shown in Figure 8, the method may include:

S802, acquiring the first target image to be processed;

S804, using the offset extraction network to obtain second offsets respectively corresponding to various preset angles from multiple second target images;

Wherein, the offset extraction network is obtained by training the neural network training method as shown in any of the foregoing implementations; the second offset indicates the offset between the roof and the base in the second target image; The plurality of second target images are obtained by respectively rotating the first target image through the various preset angles.

S806. For each of the various preset angles, reversely rotate the second offset corresponding to the angle to obtain the reverse second offset corresponding to the angle;

S808, merging the inverse second offsets corresponding to the various preset angles respectively, to obtain a first offset corresponding to the first target image.

The method can be applied to any type of electronic device.

Take the offset extraction network including the offset expansion unit shown in FIG. 6 as an example.

When executing S804, STNs corresponding to 0 degrees, 90 degrees, 180 degrees and 270 degrees can be used to rotate the first image features corresponding to the first target image, and the classifier can be used to obtain the rotation of the first target image by 0 degrees , 90 degrees, 180 degrees and 270 degrees respectively correspond to the second offset.

Then, when executing S806, the rotation matrices corresponding to 0 degrees, 90 degrees, 180 degrees and 270 degrees can be used to reversely rotate the second offsets obtained in S804 respectively to obtain the corresponding position of the first target image when it is not rotated. A number of reverse second offsets. It should be noted that the reverse rotation refers to a rotation manner opposite to the rotation direction shown in S804. For example, what is shown in S804 is clockwise rotation, and the reverse rotation is counterclockwise rotation.

Afterwards, when S808 is executed, a plurality of reverse second offsets may be fused by means of summation, product, and average, etc., to obtain the first offset corresponding to the first target image. Therefore, when extracting the offset of the first target image, the second offset obtained by rotating the first target image by various angles can be fused, so that the obtained first offset is more robust and accurate. higher.

In some implementations, base extraction may also be performed on the first target image. The method also includes:

S810. Obtain the roof area in the first target image by using the roof area extraction network included in the base area extraction network.

S812. Using the first offset corresponding to the first target image, perform translation transformation on the obtained roof area to obtain a base area corresponding to the first target image.

The base area extraction network includes the offset extraction network in addition to the roof area extraction network. In some implementation manners, in order to improve the accuracy of base area extraction, the base area extraction network may further include a building frame extraction network. The base region extraction network can be trained by using the neural network training method shown in the foregoing embodiments.

In the embodiments of the present disclosure, first, since the base region extraction network is trained based on a small amount of labeled sample data in the above embodiments, the network training cost can be reduced, the network training efficiency can be improved, and the base extraction cost can be further reduced. Second, since the high-precision base area extraction network is used for base extraction, the accuracy of building base extraction can be improved, thereby improving the statistical accuracy for buildings. Third, the obvious characteristics of the roof area feature and the offset feature can be used to indirectly obtain the building base, which is helpful for obtaining an accurate base.

Corresponding to any of the above implementation manners, the present disclosure also proposes a neural network training device 90 .

Please refer to FIG. 9 , which is a schematic structural diagram of a neural network training device shown in the present disclosure.

As shown in Figure 9, the device 90 may include:

The offset obtaining module 91 is configured to use the offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images; the second predicted offset indicates the The offset between the roof and the base in the second sample image; the multiple second sample images are obtained by rotating the first sample image by the various preset angles respectively; the first sample image Annotated with the first true offset;

A rotation module 92, configured to rotate the first real offset by the various preset angles to obtain second real offsets respectively corresponding to the various preset angles;

The adjustment module 93 is configured to adjust network parameters of the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles.

In some of the illustrated embodiments, the obtaining module is configured to: for each of the various preset angles, use an offset extraction network to convert the first sample image corresponding to An image feature is rotated by the preset angle to obtain a second image feature corresponding to the preset angle; based on the second image feature, a second predicted offset corresponding to the preset angle is obtained.

In some of the illustrated embodiments, the obtaining module is configured to: for each preset angle among the plurality of preset angles, use the offset extraction network to include the The space transformation network rotates the first image feature by the preset angle to obtain a second image feature corresponding to the preset angle.

In some of the illustrated embodiments, the space transformation network includes a sampler for image rotation based on interpolation; wherein, the sampler includes a sampling grid determined based on a preset angle corresponding to the space transformation network; The sampling grid can characterize the pixel point correspondence between the first image feature and the second image feature. In this case, the obtaining module is configured to: use the sampler to determine a plurality of pixels in the first image feature corresponding to each pixel in the second image feature by using the sampling grid points, and map the pixel values of the plurality of pixel points based on an interpolation method to obtain pixel values corresponding to each pixel point in the second image feature.

In some of the illustrated embodiments, the adjusting module 93 is configured to: obtain the second real offset corresponding to the various preset angles and the second predicted offset according to the Offset loss information corresponding to multiple preset angles; adjusting network parameters of the offset extraction network based on the offset loss information respectively corresponding to the multiple preset angles.

In some of the illustrated embodiments, the first sample image is also marked with real roof area information; the device 90 further includes: a roof area obtaining module, configured to use a roof area extraction network to obtain the first sample image Roof area prediction information in this image; wherein, the roof area extraction network and the offset extraction network share a feature extraction network and belong to the same base area extraction network; the base area extraction network is used to obtain the roof based on The base area is obtained by the area and the offset; the first training module is configured to train the roof area extraction network based on the real information of the roof area and the obtained prediction information of the roof area.

In some of the illustrated embodiments, the base area extraction network includes a building frame extraction network, the building frame extraction network includes the feature extraction network, and the first sample image is also marked with the building frame real information; the device 90 also includes: a building frame obtaining module, configured to utilize the building frame extraction network to obtain building frame prediction information in the first sample image; a second training module, configured to The real building frame information and the obtained building frame prediction information are used to train the building frame extraction network.

In some of the illustrated embodiments, the device 90 further includes: a sample expansion module, configured to acquire, for each of the multiple regions, one or more frames of original sample images corresponding to the region; wherein, In the case where the region corresponds to multiple frames of original sample images, there are at least two frames of the original sample images with different acquisition angles; one frame of the original sample image corresponding to the region is used as the corresponding Annotate the ground truth information of the base area on the first sample image; determine the truth information of the base area marked in the first sample image corresponding to the area as the original sample image of each frame corresponding to the area The ground truth information of the base area is based on the original sample image and the first sample image respectively corresponding to the multiple areas to obtain a training sample set.

Corresponding to any of the above implementation manners, the present disclosure further proposes an image processing device. The device may include: an acquisition module, used to acquire the first target image to be processed; an offset acquisition module, used to use the offset extraction network to obtain images corresponding to various preset angles from multiple second target images The second offset; wherein, the offset extraction network includes a network trained by using the neural network training method shown in any of the preceding embodiments; the second offset indicates the second target image The offset between the roof and the base; the plurality of second target images are obtained by rotating the first target image respectively at the various preset angles; the reverse rotation module is used for the various For each angle in the preset angles, the second offset amount corresponding to the angle is reversely rotated to obtain the reverse second offset amount corresponding to the angle; the fusion module is used to separate the various preset angles The corresponding inverse second offset is fused to obtain the first offset corresponding to the first target image.

The embodiments of the neural network training device and/or image processing device shown in the present disclosure can be applied to electronic equipment. Accordingly, the present disclosure discloses an electronic device that may include a processor and a memory for storing processor-executable instructions. Wherein, the processor is configured to invoke the executable instructions stored in the memory to implement the aforementioned neural network training method and/or image processing method.

Please refer to FIG. 10 . FIG. 10 is a schematic diagram of a hardware structure of an electronic device shown in the present disclosure.

As shown in Figure 10, the electronic device may include a processor for executing instructions, a network interface for network connection, a memory for storing operating data for the processor, and a memory for storing neural network training devices and/or The image processing device corresponds to a non-volatile memory of instructions.

Wherein, the embodiment of the apparatus may be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where it is located. From the perspective of hardware, in addition to the processor, memory, network interface, and non-volatile memory shown in Figure 10, the electronic device where the device in the embodiment is usually based on the actual function of the electronic device can also include other Hardware, no more details on this.

It can be understood that, in order to increase the processing speed, the device corresponding instructions may also be directly stored in the memory, which is not limited herein.

The present disclosure proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program can be used to cause a processor to execute the aforementioned neural network training method and/or image processing method.

Those skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may employ a computer implemented on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) with computer-usable program code embodied therein. The form of the Program Product.

"And/or" in the present disclosure means at least one of the two, for example, "A and/or B" may include three options: A, B, and "A and B".

Each embodiment in the present disclosure is described in a progressive manner, the same and similar parts of the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

The specific embodiments of the present disclosure have been described above. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

Embodiments of the subject matter and functional operations described in this disclosure can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware that may include the structures disclosed in this disclosure and their structural equivalents, or their A combination of one or more of them. Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e. one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

Computers suitable for the execution of a computer program may include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory and/or a random access memory. The basic components of a computer may include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or Send data to it, or both. However, a computer is not required to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.

Computer-readable media suitable for storing computer program instructions and data may include all forms of non-volatile memory, media and memory devices and may include, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, magnetic disks such as internal hard drives or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this disclosure contains many specific implementation details, these should not be construed as limitations on the scope of any disclosure or of what may be claimed, but rather as primarily describing features of particular disclosed embodiments. Certain features that are described in multiple embodiments within this disclosure can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described and even initially claimed as such, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system modules and components in the described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product, or packaged into multiple software products.

Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above are only preferred embodiments of one or more embodiments of the present disclosure, and are not intended to limit one or more embodiments of the present disclosure. Any modification, equivalent replacement, improvement, etc. should be included in the protection scope of one or more embodiments of the present disclosure.

Claims

A neural network training method, comprising:

Using the offset extraction network to obtain second predicted offsets corresponding to various preset angles from a plurality of second sample images; the second predicted offsets indicate the roof and the base in the second sample image The offset between them; the plurality of second sample images are obtained by rotating the first sample image respectively at the various preset angles; the first sample image is marked with a first real offset;

Rotating the first real offset by the multiple preset angles to obtain second real offsets respectively corresponding to the multiple preset angles;

Adjusting network parameters of the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles.
According to the method according to claim 1, said using the offset extraction network to obtain the second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images, comprising:

For each preset angle in the plurality of preset angles,

using the offset extraction network to rotate the first image feature corresponding to the first sample image by the preset angle to obtain a second image feature corresponding to the preset angle;

Based on the second image feature, a second predicted offset corresponding to the preset angle is obtained.
According to the method according to claim 2, the first image feature corresponding to the first sample image is rotated by the preset angle by using the offset extraction network to obtain an image corresponding to the preset angle Second image features, including:

Using a space transformation network included in the offset extraction network corresponding to the preset angle to rotate the first image feature by the preset angle to obtain a second image feature corresponding to the preset angle.
The method according to claim 3, wherein the space transformation network includes a sampler for image rotation based on an interpolation method, and the sampler includes a sampling grid determined based on a preset angle corresponding to the space transformation network; the sampling The grid can characterize the pixel point correspondence between the first image feature and the second image feature; the use of the space transformation network included in the offset extraction network corresponding to the preset angle will The first image feature rotates the preset angle to obtain a second image feature corresponding to the preset angle, including:

Using the sampling grid, determine a plurality of pixel points in the first image feature corresponding to each pixel point in the second image feature, and

Through the sampler, the pixel values of the plurality of pixel points are mapped based on an interpolation manner, to obtain the pixel values respectively corresponding to the pixel points in the second image feature.
The method according to any one of claims 1 to 4, wherein the offset is adjusted based on the second real offset and the second predicted offset respectively corresponding to the various preset angles Quantitatively extract the network parameters of the network, including:

Obtaining offset loss information respectively corresponding to the various preset angles according to the second actual offset corresponding to the various preset angles and the second predicted offset;

The network parameters of the offset extraction network are adjusted based on the offset loss information respectively corresponding to the various preset angles.
According to the method according to any one of claims 1 to 5, the first sample image is also marked with the real information of the roof area; the method also includes:

Using a roof area extraction network to obtain roof area prediction information in the first sample image; wherein, the roof area extraction network shares a feature extraction network with the offset extraction network;

The roof area extraction network is trained based on the real roof area information and the obtained roof area prediction information.
The method according to claim 6, wherein the roof area extraction network and the offset extraction network belong to the same base area extraction network, the base area extraction network includes a building frame extraction network, and the building frame extraction network Including the feature extraction network, the first sample image is also marked with the real information of the building frame; the method also includes:

Obtaining building frame prediction information in the first sample image by using the building frame extraction network;

Based on the real building frame information and the obtained building frame prediction information, the building frame extraction network is trained.
According to the method described in any one of claims 1 to 7, the generation method for training the training sample set of the offset extraction network comprises:

For each of the multiple regions,

Acquire one or more frames of original sample images corresponding to the area; wherein, in the case where the area corresponds to multiple frames of original sample images, there are at least two frames of the original sample images with different acquisition angles;

Using a frame of the original sample image corresponding to the area as the first sample image corresponding to the area to mark the true value information of the base area;

determining the ground truth information of the base area marked in the first sample image corresponding to the area as the ground truth information of the base area of each frame of the original sample image corresponding to the area;

The training sample set is obtained based on the original sample image and the first sample image respectively corresponding to the multiple regions.
An image processing method, comprising:

Obtain the first target image to be processed;

Using the offset extraction network to obtain second offsets corresponding to various preset angles from a plurality of second target images; wherein, the offset extraction network includes using any one of claims 1 to 8 The network obtained by training the neural network training method described above; the second offset indicates the offset between the roof and the base in the second target image; the multiple second target images are obtained by combining the first The target image is obtained by rotating the various preset angles respectively;

For each of the various preset angles, reversely rotate the second offset corresponding to the angle to obtain the reverse second offset corresponding to the angle;

The inverse second offsets corresponding to the various preset angles are fused to obtain the first offset corresponding to the first target image.
The method of claim 9, further comprising:

Using the roof area extraction network included in the base area extraction network to obtain the roof area in the first target image; wherein, the base area extraction network also includes the offset extraction network; the base area extraction network utilizes such as Obtained by the neural network training method training described in claim 7;

Using the first offset corresponding to the first target image, perform translation transformation on the obtained roof area to obtain the base area corresponding to the first target image.
A neural network training device, comprising:

An obtaining module, configured to use an offset extraction network to obtain second predicted offsets respectively corresponding to various preset angles from a plurality of second sample images; the second predicted offsets indicate the second samples The offset between the roof and the base in the image; the plurality of second sample images are obtained by rotating the first sample image respectively through the various preset angles; the first sample image is marked with the first real offset;

A rotation module, configured to rotate the first real offset by the multiple preset angles to obtain second real offsets respectively corresponding to the multiple preset angles;

An adjustment module, configured to adjust network parameters of the offset extraction network based on the second real offset and the second predicted offset respectively corresponding to the various preset angles.
An image processing device, comprising:

An acquisition module, configured to acquire the first target image to be processed;

An offset obtaining module, configured to use an offset extraction network to obtain second offsets respectively corresponding to various preset angles from a plurality of second target images; wherein, the offset extraction network includes using such as The network trained by the neural network training method described in any one of claims 1 to 8; the second offset indicates the offset between the roof and the base in the second target image; the plurality of second The target image is obtained by rotating the first target image respectively at the various preset angles;

A reverse rotation module, configured to reversely rotate the second offset corresponding to the angle for each of the various preset angles to obtain the reverse second offset corresponding to the angle;

A fusion module, configured to fuse the inverse second offsets corresponding to the various preset angles to obtain the first offset corresponding to the first target image.
The apparatus of claim 12, further comprising:

The roof area obtaining module is configured to use the roof area extraction network included in the base area extraction network to obtain the roof area in the first target image; wherein, the base area extraction network further includes the offset extraction network; The base area extraction network is obtained by training the neural network training method as claimed in claim 7;

The translation module is configured to use the first offset corresponding to the first target image to perform translation transformation on the obtained roof area to obtain the base area corresponding to the first target image.
An electronic device comprising:

processor;

memory for storing processor-executable instructions;

Wherein, the processor implements the neural network training method according to any one of claims 1 to 8, and/or the image processing method according to claim 9 or 10 by running the executable instructions.
A computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to enable a processor to execute the neural network training method according to any one of claims 1 to 8, and/or claim 9 or The image processing method described in 10.