CN113344180A

CN113344180A - Neural network training and image processing method, device, equipment and storage medium

Info

Publication number: CN113344180A
Application number: CN202110602248.5A
Authority: CN
Inventors: 王金旺
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-03
Also published as: TW202248910A; WO2022252558A1

Abstract

The application provides a neural network training and image processing method, device, equipment and storage medium. The method can include obtaining a set of images; the image set includes acquired images corresponding to a plurality of regions, respectively. Under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles. And carrying out base region true value information annotation on at least one frame of target acquisition image corresponding to each region in the plurality of regions respectively. And for each region, determining the base region truth value information marked on the target acquisition image corresponding to the region as the base region truth value information of each frame acquisition image corresponding to the region to obtain a training sample set so as to carry out neural network training based on the training sample set.

Description

Neural network training and image processing method, device, equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to a neural network training and image processing method, device, equipment and storage medium.

Background

With the gradual increase of the urbanization rate, the buildings need to be counted in time to complete tasks such as city planning, map drawing, building change monitoring and the like.

At present, a building base extraction network generated based on a neural network is mainly used for extracting a building base in a remote sensing collected image, and then building statistics is carried out by using the obtained building base.

However, the labeling cost of the data is very high, a large number of labeled samples cannot be obtained, and a high-precision building base extraction network is difficult to train by using a small number of labeled samples.

Disclosure of Invention

In view of the above, the present application at least discloses a neural network training method. The method can comprise the following steps: acquiring an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles; performing base region truth value information annotation on at least one frame of target acquisition image corresponding to each region in the plurality of regions respectively; and for each region, determining the base region truth value information marked on the target acquisition image corresponding to the region as the base region truth value information of each frame acquisition image corresponding to the region to obtain a training sample set so as to carry out neural network training based on the training sample set.

In some implementations shown, the method further comprises: acquiring the training sample set; utilizing a building base extraction network to obtain a roof area and an offset corresponding to each collected image in the training sample set; wherein the offset characterizes an offset between the roof region and the base region; based on the offset obtained for each collected image, carrying out translation transformation on the roof area corresponding to the offset to obtain base areas corresponding to each collected image; and adjusting the network parameters of the building base extraction network based on the base area true value information corresponding to each acquired image and the base areas obtained by aiming at each acquired image.

In some implementations shown, the method further comprises: marking base position truth value information of the target acquisition image corresponding to each region respectively; and aiming at each region, determining the base position truth value information marked on the target acquisition image corresponding to the region as the base position truth value information of each frame acquisition image corresponding to the region to obtain a training sample set.

In some implementations shown, the method further comprises: acquiring the training sample set; acquiring a roof area, an offset and a roof position respectively corresponding to each acquired image in the training sample set by utilizing a roof area extraction network, an offset extraction network and a roof position extraction network which are included in a building base extraction network, wherein the offset represents the offset between the roof area and the base area; adjusting network parameters of the roof area extraction network based on the base area true value information corresponding to each acquired image, and the roof area and the offset acquired by each acquired image; and adjusting network parameters of the roof position extraction network and the offset extraction network based on the base position truth value information corresponding to each acquired image and the roof position and the offset obtained by each acquired image.

In some illustrated implementations, the adjusting network parameters of the roof region extraction network based on the real-value information of the base region corresponding to each of the acquired images and the roof region and offset obtained for each of the acquired images includes: for each frame of image in each collected image, translating the truth value information of the base area by using the offset corresponding to the image to obtain the truth value information of a first roof area corresponding to the image; obtaining region loss information corresponding to the image based on the first rooftop region truth information corresponding to the image and a rooftop region obtained for the image; and adjusting the network parameters of the roof area extraction network through back propagation based on the area loss information corresponding to each acquired image.

In some illustrated implementations, the adjusting network parameters of the roof position extraction network and the offset extraction network based on the real-value information of the base position corresponding to each of the acquired images and the roof position and the offset obtained for each of the acquired images includes: for each frame of image in each collected image, translating the roof position corresponding to the image by using the offset corresponding to the image to obtain the base position corresponding to the image; obtaining position loss information corresponding to the image based on the base position truth value information corresponding to the image and the base position obtained aiming at the image; and adjusting network parameters of the roof position extraction network and the offset extraction network through back propagation based on the position loss information corresponding to each acquired image.

In some implementations shown, the rooftop area extraction network, offset extraction network, and the rooftop location extraction network share a feature extraction network.

In some illustrated implementations, at least a portion of the collected images of the training sample set are further annotated with second rooftop region truth information, offset truth information, and rooftop location truth information; the method further comprises at least one of: adjusting network parameters of the roof region extraction network based on second roof region truth information labeled by the at least partially acquired image and a roof region obtained for the at least partially acquired image; adjusting network parameters of the offset extraction network based on the offset truth information of the at least partially acquired image annotation and an offset obtained for the at least partially acquired image; adjusting network parameters of the roof position extraction network based on the real-valued roof position information labeled by the at least part of the collected images and the obtained roof position of the at least part of the collected images.

In some illustrated implementations, the at least partially acquired image is further annotated with building border truth information; the method further comprises the following steps: extracting the building frame corresponding to the at least part of the acquired image by using a building frame extraction network included in the building base extraction network; wherein the building border extraction network comprises the feature extraction network; adjusting network parameters of the building border extraction network based on the building border true value information of the at least partially captured image annotation and the building border obtained for the at least partially captured image.

In some implementations shown, the method further comprises: and pre-training the building base extraction network by utilizing the collected images marked with second roof area true value information, offset true value information and roof position true value information in the training sample set.

In some implementations shown, the captured images in the training sample set further include first offset truth information; the offset indicates an offset between the roof and the base in the captured image; the method further comprises the following steps: obtaining offsets corresponding to the multiple preset angles respectively by using the offset extraction network; the multiple preset angles are used for rotating the collected image or the image characteristics corresponding to the collected image; respectively rotating the first offset real information by the multiple preset angles to obtain second offset real information respectively corresponding to the multiple preset angles; and adjusting the offset to extract the network parameters of the network by using the second offset real information and the obtained offset which respectively correspond to the plurality of preset angles.

In some illustrated implementations, the obtaining, by using the offset extraction network, offsets corresponding to a plurality of preset angles includes: respectively rotating the first image features corresponding to the acquired images by multiple preset angles by using an offset extraction network to obtain second image features respectively corresponding to the multiple preset angles; and obtaining offsets respectively corresponding to the multiple preset angles based on the second image characteristics.

The application also provides an image processing method, which comprises the following steps: receiving a remote sensing image to be processed; extracting a building roof area and an offset in the remote sensing image to be processed by utilizing a building base extraction network; the building base extraction network is obtained by training through a neural network training method shown in any one of the implementation manners; and carrying out translation transformation on the roof area by using the offset to obtain a building base area corresponding to the remote sensing image to be processed.

The present application further provides a neural network training device, including: an acquisition module for acquiring an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles; the first labeling module is used for performing base region truth value information labeling on at least one frame of target acquisition image corresponding to each region in the plurality of regions; the first determining module is configured to determine, for each region, the truth value information of the pedestal region labeled in the target captured image corresponding to the region as the truth value information of the pedestal region of each frame of the captured image corresponding to the region, to obtain a training sample set, and perform neural network training based on the training sample set.

The present application also proposes an image processing apparatus including: the receiving module is used for receiving the remote sensing image to be processed; the extraction module is used for extracting a network by utilizing a building base and extracting a building roof area and an offset in the remote sensing image to be processed; the building base extraction network is obtained by training through a neural network training method shown in any one of the implementation modes; and the translation module is used for carrying out translation transformation on the roof area by utilizing the offset to obtain a building base area corresponding to the remote sensing image to be processed.

The present application further proposes an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor implements the neural network training method and/or the image processing method by executing the executable instructions.

The present application also proposes a computer-readable storage medium, which stores a computer program for causing a processor to execute the neural network training method and/or the image processing method.

In the foregoing solution, firstly, since the building base of the same area is not changed, after image registration is performed on each captured image captured of the same area, the base area and position of the building in each captured image are the same. The method includes labeling the truth value information of the base region of at least one frame of target acquisition image in the same region, and labeling the truth value information of the base region of each frame of acquisition image in the region, so that sample expansion is performed, namely a large number of training samples are obtained through a small number of labeling operations.

Secondly, the building base prediction network training can be performed by using a training sample set obtained by sample expansion based on the characteristic that the same building base area does not change, and the building base prediction network training is facilitated by using a small amount of labeled samples.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate one or more embodiments of the present application or technical solutions in the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a flow chart of a method of network training illustrated in the present application;

FIG. 2 is a schematic flow chart of a neural network training method;

FIG. 3 is a schematic view of a building base area extraction process shown in the present application;

FIG. 4 is a schematic view of a building base area extraction process shown in the present application;

FIG. 5 is a schematic flow chart of a neural network training method shown in the present application;

FIG. 6 is a method flow diagram of a network training method shown in the present application;

FIG. 7 is a schematic flow chart of a neural network training method;

FIG. 8 is a schematic diagram of a building base area extraction network training process shown in the present application;

FIG. 9 is a schematic diagram of a building base extraction network training process shown in the present application;

FIG. 10 is a schematic diagram of a neural network training device according to the present application;

fig. 11 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The application aims to provide a network training method. The method utilizes the characteristic that the same building base area can not change, and base area truth value information is shared among multi-frame collected images corresponding to the same area, so that the effect of expanding training samples is achieved, and further, a small amount of marked samples are utilized to train a high-precision building base extraction network.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method of network training according to the present application. The network training method can be applied to electronic equipment. The electronic equipment can execute the method by carrying a software device corresponding to the network training method. The electronic equipment can be a notebook computer, a server, a mobile phone, a PAD terminal and the like. The type of the electronic device is not particularly limited in this application. The electronic device may be a client device or a server device. The server may be a cloud. The following description will be given taking an execution body as an electronic device (hereinafter simply referred to as a device) as an example.

As shown in fig. 1, the method may include:

s102, acquiring an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles.

The images of the image set may be acquired by any image acquisition device deployed to acquire images of the plurality of regions. In the multi-frame collected images collected in the same area, at least two frames of target images corresponding to the same area have different collection angles, so that the information contained in the training sample can be enriched, and the adaptability of the neural network is improved.

The captured images in the image set may be stored in a storage medium sorted by capture area. The device may retrieve the set of images from a storage medium.

In some implementations, the set of images can include multiple time phase maps acquired for the plurality of regions. The multi-time phase diagram can be a multi-frame remote sensing collected image collected aiming at the same area at different moments.

And S104, performing base region true value information annotation on at least one frame of target acquisition image corresponding to each region in the plurality of regions respectively.

The target captured image may be a sharpness-reaching image arbitrarily selected from the captured images.

In some implementations, at least one frame of captured image may be selected from the captured images corresponding to each region, respectively, as the target captured image. And then, marking the truth information of the base area in a pre-marking mode.

Wherein the base region truth information may be pixel-level truth information. The truth value information of the foundation area can be that pixel points in the foundation area of the building in the remote sensing collected image are set to be 1, and pixel points outside the foundation area are set to be 0.

S106, aiming at each region, determining the base region truth value information marked on the target acquisition image corresponding to the region as the base region truth value information of each frame acquisition image corresponding to the region to obtain a training sample set so as to carry out neural network training based on the training sample set.

In some implementations, the truth information of the base region labeled for each region in S104 may be used as the real information corresponding to each acquired image in each region, so as to achieve the purpose of expanding the training samples.

Since the building base of the same area is unchanged, the base area and position of the building in each captured image are the same after image registration of each captured image captured of the same area. The method includes labeling the truth value information of the base region of at least one frame of target acquisition image in the same region, and labeling the truth value information of the base region of each frame of acquisition image in the region, so that sample expansion is performed, namely a large number of training samples are obtained through a small number of labeling operations.

In some implementations, neural network training can be performed based on the resulting training sample set.

Referring to fig. 2, fig. 2 is a schematic flow chart of a neural network training method.

As shown in fig. 2, the method includes:

s202, obtaining the training sample set.

And S204, extracting a base area corresponding to each acquired image in the training sample set by using a building base extraction network.

And S206, adjusting network parameters of the building foundation extraction network based on the foundation area true value information corresponding to each acquired image and the foundation areas acquired aiming at each acquired image.

In some implementations, the device may perform S202 in response to the network training request.

In some implementations, the set of training samples may be stored in a storage medium so that the device may retrieve the stored set of training samples from the storage medium. Thereafter, the device may perform S204-S206.

At least two ways of extracting the building base may be included in the present application. Firstly, a building base extraction network (hereinafter referred to as a base extraction network) can be used for directly extracting a building base; second, the base network can be used to extract the building roof and the offset indicating the offset between the roof and the base, and then transform the roof by the offset to obtain the base indirectly.

The training modes of the base extraction network corresponding to different modes are different. The following examples are given for the two modes.

The method comprises the following steps of (I) directly extracting a building base.

Referring to fig. 3, fig. 3 is a schematic view illustrating a building base area extraction process according to the present application.

As shown in fig. 3, the remote sensing image is input into the base extraction network, and then the base area can be directly obtained.

The base extraction network shown in fig. 3 may be a network constructed based on an object detection network. In some implementations, the object-detection network may be an object-detection network constructed based on RCNN (Region Convolutional Neural network), FAST-RCNN (FAST Region Convolutional Neural network), FASTER-RCNN (FASTER Region Convolutional Neural network), or MASK-RCNN (MASK Region Convolutional Neural network).

In some implementations, to improve the pedestal region extraction accuracy, a MASK-RCNN that characterizes the region with higher accuracy may be employed. The MASK-RCNN may include RPN (Region of interest) cells, as well as Region of interest (roi) cells, and the like.

The RPN is used for generating candidate frames corresponding to all objects in the collected image. After the candidate frames are obtained, regression and classification of the candidate frames can be performed to obtain the frames corresponding to the buildings. And the Rol Align unit is used for extracting the visual features corresponding to the building from the acquired image according to the frame corresponding to the building. And then, extracting a base area, a roof area, an offset, a roof position and the like according to the functional requirements of the target detection network by using the corresponding visual characteristics of the building.

When S204 is executed, the device may input each acquired image in the training sample set into the base extraction network for base extraction, to obtain a base region corresponding to each acquired image.

Then, in step S206, the loss information of the pedestal area corresponding to each of the captured images may be obtained according to the real-valued information of the pedestal area labeled for each of the captured images and the pedestal area corresponding to each of the captured images by using a predetermined loss function. And then, adjusting the base to extract the network parameters of the network by using a back propagation method after the descending gradient is obtained.

After multiple rounds of training are performed, network training is completed, and a building base prediction network after training is obtained.

In the scheme, a training sample set obtained by sample expansion based on the characteristic that the same building base area does not change can be used for building base prediction network training, and the building base prediction network training method is beneficial to training a high-precision building base extraction network by using a small amount of labeled samples.

And (II) indirectly extracting the building base.

Referring to fig. 4, fig. 4 is a schematic view illustrating a building base area extraction process according to the present application.

As shown in fig. 4, the roof area of the building and the offset indicating the roof-to-base offset may be obtained first by inputting each captured image into the base extraction network. The offset can then be used to transform (e.g., translate) the rooftop area to obtain the plinth area.

The base extraction network shown in fig. 4 may include a roof area extraction network and an offset extraction network. The roof extraction network and the offset extraction network may be networks constructed based on a target detection network. The target detection network may be any one of RCNN, FAST-RCNN, FASTER-RCNN, or MASK-RCNN. In some implementations, to improve the pedestal region extraction accuracy, a MASK-RCNN that characterizes the region with higher accuracy may be employed.

In some implementations, the rooftop area extraction network and the offset extraction network may share a feature extraction network. The shared feature extraction network may include a backbone network, regional feature extraction units, and the like. Therefore, the network structure can be simplified, and network training is facilitated. And responding to the roof area extraction network and the offset extraction network as MASK-RCNN. The two networks may also share RPNs, and Rol Align units, etc.

Referring to fig. 5, fig. 5 is a schematic flow chart of a neural network training method according to the present application.

As shown in fig. 5, a method of neural network training may include:

s502, extracting a network by using a building base to obtain a roof area and an offset corresponding to each acquired image in the training sample set; wherein the offset characterizes an offset between the roof region and the base region.

In some implementations, the roof area and the offset of each captured image may be extracted by using a roof area extraction network and an offset extraction network included in the building base extraction network.

And S504, based on the offset obtained for each collected image, carrying out translation transformation on the roof area corresponding to the offset to obtain base areas corresponding to each collected image.

In some implementations, a translation operation may be performed on each pixel point surrounded by the roof region, respectively, to obtain the base region.

And S506, adjusting network parameters of the building foundation extraction network based on the foundation area true value information corresponding to each acquired image and the foundation areas acquired aiming at each acquired image.

In some implementation manners, the loss information of the pedestal area corresponding to each acquired image may be obtained according to the truth value information of the pedestal area labeled by the first acquired image and the pedestal area corresponding to each acquired image by using a preset loss function. And then, adjusting the network parameters of the base extraction network by using a back propagation method after the descending gradient is obtained.

In the scheme, on one hand, the roof area and the offset of the building are extracted firstly, then the roof area is converted through the offset, the building base area is obtained indirectly, the characteristic that the characteristics of the roof area and the offset are obvious in the remote sensing collected image can be utilized, the base extraction precision is improved, and the building base with higher precision can be obtained even if the building base is shielded. On the other hand, the building base prediction network training can be performed by using a training sample set obtained by sample expansion based on the characteristic that the same building base area does not change, and the training of a high-precision building base extraction network can be trained by using a small amount of labeled samples.

In some implementation modes, the characteristic that the base area and the position of the same building are not changed can be utilized, the base area true value information and the base position true value information are shared among the multi-frame collected images corresponding to the same area, the effect of expanding training samples is achieved, and therefore a small number of labeled samples are used beneficially, and the high-precision building base extraction network is trained.

Referring to fig. 6, fig. 6 is a flowchart illustrating a method of network training according to the present application. As shown in fig. 6, the method may include:

s604, marking the real value information of the base position of the target acquisition image corresponding to each region respectively.

In some implementations, the labeling of the information of the truth value of the base position can be performed in advance. The pedestal position truth information may include coordinates of a center pixel point within a pedestal region, and width and height information of the pedestal region. In some implementations, the base position truth information may be represented using R ═ c (cx, cy, w, h). Wherein cx and cy respectively represent the horizontal and vertical coordinates of the central pixel point of the base area, and w and h respectively represent the width and height of the base area.

S606, aiming at each region, determining the base position truth value information marked on the target acquisition image corresponding to the region as the base position truth value information of each frame acquisition image corresponding to the region to obtain a training sample set.

In some implementations, the base position true value information labeled for each region in S604 may be used as the real information corresponding to each acquired image in each region, so as to achieve the purpose of expanding the training sample. The acquired images in the training sample set comprise base area true value information and base position true value information.

Referring to fig. 7, fig. 7 is a schematic flow chart of a neural network training method.

As illustrated in fig. 7, the method may include:

s702, obtaining the training sample set.

S704, acquiring a roof area, an offset and a roof position respectively corresponding to each acquired image in the training sample set by using a roof area extraction network, an offset extraction network and a roof position extraction network which are included in a building base extraction network, wherein the offset represents the offset between the roof area and the base area;

s706, adjusting network parameters of the roof area extraction network based on the base area truth value information corresponding to each acquired image, and the roof area and the offset acquired by each acquired image;

and S708, adjusting network parameters of the roof position extraction network and the offset extraction network based on the base position truth value information corresponding to each acquired image, and the roof position and the offset obtained by each acquired image.

Wherein, S706 and S708 do not have a strict sequential execution order. For example, S706 and 708 may be performed in parallel. The present application does not specifically limit the execution order of S706 and S708.

The network training method can be applied to electronic equipment.

In some implementations, the device may execute S702 to obtain the training sample set from a storage medium in response to a network training request.

Thereafter, the device may perform S704-S408.

The building base extraction network (hereinafter referred to as base extraction network) may be a network constructed based on an object detection network. In some implementations, to improve the pedestal region extraction accuracy, a MASK-RCNN that characterizes the region with higher accuracy may be employed.

The base extraction network may include a rooftop area extraction network, an offset extraction network, and a rooftop location extraction network. Wherein the roof region extraction network may be used to extract a building roof region. The offset extraction network may be configured to extract an offset between the roof and the base. The rooftop location extraction network may be used to extract a rooftop location. The offset can then be used to transform (e.g., translate) the rooftop area to obtain the plinth area. And the offset can also translate the roof position to obtain the base position.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a building base area extraction network training process according to the present application.

The base extraction network shown in fig. 8 includes a rooftop area extraction network, an offset extraction network, and a rooftop location extraction network. The roof area and the offset extracted by the roof area extraction network and the offset extraction network can be subjected to area conversion to obtain the base area.

When the network is trained, the network can be modified, and a base region loss information determining branch and a base position loss information determining branch are added, so that network parameters are updated according to the determined loss. The footprint loss information may characterize the error between the resulting footprint and footprint truth information. The base position loss information may characterize an error between the derived base position and base position truth information.

In some implementations, in S706, S7062 may be executed to translate, for each frame of image in the collected images, the ground region truth information by using an offset corresponding to the image, so as to obtain first rooftop region truth information corresponding to the image. S7064 may then be performed to obtain region loss information corresponding to the image based on the first rooftop region truth information corresponding to the image and a rooftop region obtained for the image. S7066 may be executed to adjust network parameters of the roof area extraction network through back propagation based on the area loss information corresponding to each of the acquired images.

In the method for determining the loss of the plinth region described in S502 to S506, it is necessary to translate the extracted roof region by using the offset amount to obtain the plinth region, and then calculate the plinth region loss information by using the truth information of the plinth region.

However, the extracted roof area is typically a predetermined size. For example, the roof region is 14 x 14. At this time, if the predicted offset is too large, when the roof area is translated, the pixel points in the roof area may be translated to the outside of the matrix with the preset size, so that the defects of information loss, incapability of obtaining accurate area loss information and incapability of network convergence exist.

In the solutions recorded in S7062-S7046, the ground area true value information is pixel level true value information, that is, 0 or 1 is labeled for each pixel point in the acquired image. Wherein, the pixel point marked as 1 can be regarded as the pixel point in the base area; the pixel labeled 0 can be considered as a pixel outside the pedestal region. When the translation transformation is carried out on the base truth value information, no matter how large the extracted offset is, the base truth value information can translate inside the corresponding acquired image with high probability, so that the loss of the truth value information cannot be caused, the regional loss information can be accurately obtained, and the smooth convergence of the network can be ensured.

After the regional loss information is obtained, the network parameters of the roof region extraction network can be adjusted by calculating the gradient of descent, utilizing the modes of back propagation and the like. Therefore, training for extracting the network from the roof area can be realized.

In some implementations, in S708, S7082 may be executed, where for each frame of the captured images, the roof position corresponding to the image is translated by using the offset corresponding to the image, and the base position corresponding to the image is obtained. S7084 may then be performed to obtain the position loss information corresponding to the image based on the real-valued information of the pedestal position corresponding to the image and the pedestal position obtained for the image. S7086 may be executed to adjust network parameters of the roof location extraction network and the offset extraction network through back propagation based on the location loss information corresponding to each of the acquired images.

In some implementations, R may be used₀＝(cx₀,cy₀,w₀,h₀) Representing the extracted roof location. Wherein, cx₀,cy₀Respectively representing the abscissa, w, of the central pixel of the roof region₀,h₀Respectively, the width and height of the roof area. Can use O₀The extracted offset amount is represented by (Δ x, Δ y). Wherein, Δ X and Δ Y respectively represent the displacement of the pixel point in the X-axis and Y-axis directions. By F₀＝(cx₀+Δx,cy₀+Δy,w₀,h₀) A seating area may be obtained. Then, by using a preset loss function (such as a cross entropy loss function), the base position loss information can be obtained according to the base position truth value information。

After the loss information is obtained, the descent gradient can be calculated, and the network parameters are updated according to the back propagation. Because the real value information of the roof position and the offset obtained by the roof position extraction network and the offset extraction network is needed when the base position is extracted, the roof position extraction network and the offset extraction network can be updated in the back propagation process.

In the embodiment, the extended training sample set can be used for training the roof area extraction network, the roof position extraction network and the offset extraction network so as to complete the training of the base area extraction network and obtain the high-precision base area extraction network.

In some implementations, the rooftop area extraction network, offset extraction network, and the rooftop location extraction network can adjust a feature extraction network, such as a backbone network, a regional feature extraction unit, and the like. Therefore, the network structure can be simplified, and network training is facilitated. In some implementations, the responsive roof region extraction network and offset extraction network is MASK-RCNN. The roof area extraction Network, the offset extraction Network, and the roof position extraction Network may further adjust RPN (Region of interest) and Rol Align (Region of interest) units, and the like.

Therefore, when the parameters of the three extraction networks are adjusted, the shared characteristic extraction networks can be adjusted, so that the training processes can be mutually constrained and promoted, and the network training efficiency is improved; on the other hand, the shared feature extraction network can extract features more beneficial to the extraction of the base region, so that the accuracy of the extraction of the base region is improved.

In some implementation manners, the network training efficiency and the network prediction accuracy can be improved in a joint training manner.

At least part of the acquired images of the training sample set may further be labeled with at least one of the following information: second rooftop area truth information, offset truth information, and rooftop location truth information.

In some implementations, manual labeling can be used to label the roof area, the offset, and the true roof location information.

When training the network through the training sample set, at least one of the following items can be included:

s802, adjusting network parameters of the roof area extraction network based on the second roof area truth value information labeled by the at least part of collected images and the roof area obtained by aiming at the at least part of collected images.

S804, adjusting the network parameters of the offset extraction network based on the offset truth information of the at least part of collected image annotation and the offset obtained aiming at the at least part of collected image annotation.

S806, adjusting network parameters of the roof position extraction network based on the real-value information of the roof position marked by the at least part of the collected images and the roof position obtained by aiming at the at least part of the collected images.

In some implementations, in performing S802, loss information may be obtained according to the real-valued information of the roof area and the obtained roof area by using a preset loss function (e.g., a cross entropy loss function). And then calculating gradient according to the obtained loss information, and performing back propagation to adjust the network parameters of the roof region extraction network.

In S804, a preset loss function (e.g., MSE (Mean square error) loss function) may be adopted, and the loss information is obtained according to the true offset information and the obtained offset. Then, the gradient is calculated according to the obtained loss information, and the back propagation is carried out to update the offset to extract the network parameters of the network.

In performing S806, loss information may be obtained according to the real value information of the roof position and the obtained roof position by using a preset loss function (e.g., a Smooth L1 (e.g., a slippery L1 paradigm) loss function). And then calculating gradient according to the obtained loss information, and performing back propagation to update the roof position to extract network parameters of the network.

In the example, by performing joint training on the roof region, the position and the offset extraction network of the shared feature extraction network, on one hand, learning information in multiple aspects can be introduced, so that the training processes can be mutually constrained and mutually promoted, and on the other hand, the network training efficiency is improved; on the other hand, the shared feature extraction network can extract features more beneficial to the extraction of the base region, so that the accuracy of the extraction of the base region is improved.

Referring to fig. 4, when the base extraction network shown in fig. 4 is trained, because the sample labeling cost is high, a large number of labeled samples including actual offset information cannot be obtained, and a high-precision base extraction network cannot be trained by using a small number of labeled samples.

In some implementations, the captured images in the training sample set further include first offset truth information; the offset indicates an offset between the roof and the base in the captured image.

When the offset extraction network is trained by using the training sample set, S402 may be executed to obtain offsets corresponding to a plurality of preset angles by using the offset extraction network; the multiple preset angles are used for rotating the collected image or the image characteristics corresponding to the collected image.

The collected image may be a remote sensing image labeled with the first offset real information. The offset refers to the offset between the roof and the base in the image. For example, the roof includes 10 pixel points, and the roof can be obtained by translating the 10 pixel points according to the offset.

The first offset true information may be information indicating that the roof and the base of the building are actually offset in the captured image. For example, the offset real information may be information in the form of an (x, y) vector. Wherein, x and y represent displacement of pixel point of roof region and pixel point of base region corresponding position in x-axis and y-axis direction respectively. In some implementations, the offset labeling can be performed in advance according to the real offset between the roof and the base of the building in the captured image. The present application does not specifically limit the manner of labeling the offset amount.

The preset angle can be set according to the service requirement. The number of the preset angles can be determined according to the sample size to be expanded. For example, if a large number of samples need to be expanded, a large number of preset angles can be set. The present application does not specifically limit the number and number of preset angles. The multiple preset angles are used for rotating the collected image or the image characteristics corresponding to the collected image.

In some implementations, in performing S402, a rotation matrix may be generated by using preset angles. And then shifting each pixel point included in the acquired image by using the rotation matrix to obtain the rotated acquired image. Then, the rotated captured images may be input to the offset extraction network, and offsets corresponding to the captured images may be extracted. It should be noted that, in some implementations, when the captured image is rotated, image features obtained after the captured image is subjected to feature extraction by using the feature extraction network may be rotated. Therefore, the calculation amount in the rotation process can be reduced, the rotation error introduced when the features of the rotated image are extracted can be reduced, and the network training effect can be improved.

Then, S404 may be executed to rotate the first offset real information by the multiple preset angles, respectively, so as to obtain second offset real information corresponding to the multiple preset angles, respectively.

In some implementation manners, when S404 is executed, the rotation matrices corresponding to the preset angles may be used to rotate the first offset real information corresponding to each acquired image, so as to obtain second offset real information corresponding to the acquired image after being rotated by a plurality of preset angles.

Then, S406 may be executed to adjust the network parameters of the offset extraction network by using the second actual offset information and the obtained offsets corresponding to the plurality of preset angles, respectively.

In some implementations, in S406, offset loss information respectively corresponding to the acquired image after being rotated by the multiple preset angles may be obtained according to the second offset real information respectively corresponding to the acquired image after being rotated by the multiple preset angles and the obtained offset by using a preset loss function (e.g., a cross entropy loss function). Then, based on offset loss information respectively corresponding to the acquired images after rotating the acquired images by multiple preset angles, total loss is determined by means of modes such as summation, product calculation, average calculation and the like, a descending gradient is calculated by means of the determined total loss, and network parameters of the network are extracted by adjusting the offsets through back propagation.

In the scheme, since the offset extracting network can be used to obtain the offsets corresponding to the multiple preset angles respectively, and the first offset real information is rotated by the multiple preset angles respectively, the second offset real information corresponding to the multiple preset angles respectively is obtained, and then the network parameters of the offset extracting network can be adjusted by using the second offset real information corresponding to the multiple preset angles respectively and the obtained offsets.

Therefore, the characteristic that the offset can also rotate a certain degree after the image rotates the certain degree can be utilized, and the effect of expanding the image sample with the actual offset information is achieved by rotating the image (or the image characteristics thereof) and the actual offset information, so that a high-precision offset extraction network can be trained by utilizing a small amount of labeled data labeled with the offset.

In the process of rotating the acquired image and the actual offset information, other information contained in the image is also rotated, and when the rotated acquired image is used for training the base extraction network, other branches of the base extraction network need to be fitted with other information of the rotated acquired image to complete training, so that the training time is increased, and the training efficiency is reduced.

In some implementation modes, the rotation process of the collected image can be placed in the offset extraction network, so that the collected image can be rotated inside the offset extraction network, the training of other branches cannot be influenced, namely, the convergence speed of other branches cannot be influenced, and the network training efficiency is improved.

In the step of S402, S4022 may be executed, and the first image features corresponding to the acquired image are respectively rotated by a plurality of preset angles by using an offset extraction network, so as to obtain second image features respectively corresponding to the plurality of preset angles. Then, S4024 may be executed to obtain offsets corresponding to the plurality of preset angles, respectively, based on the second image feature.

The first image feature may be an image feature obtained by extracting features of a plurality of convolution layers, pooling layers, and the like from an acquired image. In some implementations, the offset extraction network may be a network constructed based on MASK-RCNN. The offset extraction network can extract the characteristics of the collected image through the backbone network and the Rol Align unit to obtain the first image characteristics. In some implementations, the aforementioned image features can be characterized by a feature map.

In some implementation manners, when S4022 is executed, the position of each pixel in the first image feature may be transformed by using the rotation matrices corresponding to the multiple preset angles, so as to obtain the second image feature. Then, when S4024 is executed, the extraction result for the offset amount may be obtained by, for example, several convolutional layers, pooling layers, fully-connected layers, and a mapping unit (e.g., softmax (flexible maximum transfer function)).

When the network shown in fig. 4 is trained, the captured image is only rotated in the offset extraction network, and the roof area extraction network is still trained by using the non-rotated captured image. Therefore, the acquired image can be subjected to rotation change in the offset extraction network, and the training of other branches is not influenced.

In some implementation manners, in order to train the offset extraction network, the spatial transformation network may be used to perform image rotation, so that the rotation process becomes conductive, the gradient can be normally propagated in the reverse direction, and the network can be directly trained.

In some implementation manners, building frame information can be introduced in the network training process to form constraints on network training, so that the network training efficiency is improved, and the extraction of the features related to the building from the feature extraction network is facilitated.

The at least part of the collected image is also marked with real value information of the building frame. The building frame information may be coordinates of a central pixel point in the building area and width and height information of the building area.

When base extraction network training is carried out, a building frame extraction network included by the building base extraction network can be utilized to extract a building frame corresponding to at least part of the collected images; wherein the building border extraction network comprises the feature extraction network. Network parameters of the building bounding box extraction network may then be adjusted based on the at least partially captured image annotated building bounding box truth information and the building bounding box obtained for the at least partially captured image.

Therefore, building frame information can be introduced during network training, and the four extraction networks share the feature extraction network due to the roof area, the position, the offset and the building frame, so that on one hand, the four extraction networks can be mutually associated, the shared feature extraction network can share the supervision information of each task, and the convergence of the network is accelerated; on the other hand, the three extraction networks of the roof area, the position and the offset can feel the complete characteristics of the building area, and the extraction performance is further improved.

In some implementations, network training efficiency may be improved through pre-training.

In some implementations, the building floor extraction network can be pre-trained using a collected image of the training sample set labeled with second rooftop area truth information, offset truth information, and rooftop location truth information.

The pre-training process may refer to a network training process as shown in any of the foregoing implementations. In some implementations, to achieve the best network pre-training effect. The pre-training may also be a joint training. Wherein at least part of the captured images of the training sample set may include roof area, location, base area, location, offset, building frame six-item truth information. The base extraction network may include six extraction networks sharing a roof region, a location, an offset, a building border, base region loss information, and base location loss information of the feature extraction network. The six abstraction nets may serve as six branches of the base abstraction net.

During the pre-selection training, at least part of the collected images of the training sample set can be respectively input into the base extraction network, so as to obtain the output results of the six branches. And then obtaining loss information according to the six items of truth value information of at least part of the collected image annotation and the output result, and further updating the network parameters. So can carry out the joint training to six branches, promote base and draw network training efficiency and training effect.

In some implementations, after the pre-training is completed, the marked acquired images and the unmarked images in the training sample set can be randomly input into a network for training.

Therefore, a reasonable network training scheme can be provided, namely, firstly, the marked acquired images with rich truth value information are utilized to carry out systematic pre-training on the network in a joint training mode, and then the marked acquired images and the unmarked images are mixed to carry out fine adjustment on the network parameters extracted from the base, so that on one hand, the method is beneficial to training a high-precision base extraction network by utilizing a small amount of marked acquired images; on the other hand, the network training efficiency can be improved.

The following embodiments are described with reference to specific training scenarios.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating a building base extraction network training process according to the present application. The training method in this example can be deployed in any type of electronic device.

The base extraction network shown in FIG. 9 includes a network constructed based on MASK-RCNN. The network may include six branches that extract the roof area, location, offset, building surround, base area loss information, and base location loss information, respectively. The network comprises six branches sharing a backbone network, an RPN candidate frame generation network (hereinafter referred to as RPN), and a Rol Align region feature extraction unit (hereinafter referred to as Rol Align). The backbone Network may be a VGG (Visual Geometry Group) Network, a ResNet (Residual Network), an HRNet (high-to-low resolution Network), and the like, and is not particularly limited in this application.

Sets of multi-temporal maps for multiple regions (registered with acquired images) may be acquired prior to training the network. Then at least one frame of image can be selected from each group of multi-temporal images for manual annotation, so that a small number of annotated images are obtained. The annotated image may include six truth information items of a roof area, a position, a base area, a position, an offset and a building frame. It will be appreciated that the unlabeled images in the multi-phase plot can share ground area and location truth information with the labeled images because the ground area and location of the same building in the multi-phase plot do not change.

When network training is carried out, the base extraction network can be pre-trained by utilizing the marked image and in a combined training mode.

In the pre-training process, according to the number of pre-training iterations, executing a plurality of rounds of the following steps:

and inputting each marked image into a network to obtain a roof area, a roof position, an offset and a building frame corresponding to each marked image.

And then obtaining loss information corresponding to the four branches through roof areas, positions, offsets and building frame truth value information corresponding to the collected images, and updating network parameters of the four branches through back propagation.

And determining branches according to loss information of the base area and the base position based on the roof area, the position and the offset corresponding to each acquired image and truth value information of the base area and the base position to obtain loss information of the base area and the base position, and extracting network parameters of three branches of the network by adjusting the roof area, the position and the offset through back propagation.

In the pre-training process, due to the adoption of a joint training mode, various learning information can be introduced, so that the training process can be mutually constrained and promoted, the network training efficiency can be improved, and a network with a good extraction effect can be preliminarily obtained only by relying on few labeled images.

After the pre-training is finished, the marked images and the unmarked images can be mixed, and the images are randomly input into the base extraction network for training.

Wherein if the network is input with the labeled images, joint training such as a pre-training process can be performed.

If the input network is the unmarked image, the network can be used to obtain the roof area, position and offset corresponding to each unmarked image. The branch and shared floor area and floor position truth value information can be determined by utilizing the floor area and floor position loss information to obtain the floor area and floor position loss information, and network parameters of three branches of the extraction network of the roof area, the position and the offset are updated and extracted through back propagation.

Therefore, the high-precision base extraction network can be obtained by utilizing the marked images and the unmarked images and carrying out parameter fine adjustment on the pre-trained network.

According to the scheme of pre-training and then mixed training in a joint training mode, firstly, the network training efficiency can be improved, so that a network with a good extraction effect can be obtained by using a small amount of labeled images, and the dependence on labeling work is reduced; second, a shared feature extraction network (including a backbone network and a regional feature extraction network) can be promoted to extract features that are more beneficial to the extraction of the base region, thereby improving the accuracy of the extraction of the base region. Thirdly, the three branches of the extraction network for extracting the roof area, the position and the offset can feel the complete characteristics of the building area, and the branch extraction performance is improved.

After the trained building base extraction network is obtained through the implementation mode, the building base extraction can be carried out on the remote sensing image to be processed through the network. The specific implementation process may include:

receiving a remote sensing image to be processed;

extracting a building roof area and an offset in the remote sensing image to be processed by utilizing a building base extraction network; the building base extraction network is obtained by training through a neural network training method shown in any one of the implementation modes;

and carrying out translation transformation on the roof area by using the offset to obtain a building base area corresponding to the remote sensing image to be processed.

The remote sensing image to be processed can be a remote sensing image acquired by acquisition equipment deployed on site. In some implementations, the trained building base extraction network may be a network as shown in fig. 9.

On the one hand, the high-precision building base extraction network can be trained by using a small number of labeled samples, network training cost is reduced, network training efficiency is improved, and base extraction cost is reduced. On the other hand, the high-precision base extraction network can be utilized to extract the base, so that the extraction precision of the base of the building is improved, and the statistical precision of the building is further improved.

Corresponding to any implementation manner, the application also provides a neural network training device 100.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a neural network training device according to the present application.

As shown in fig. 10, the apparatus 100 may include:

an obtaining module 101, configured to obtain an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles;

a first labeling module 102, configured to perform base region truth value information labeling on at least one frame of target captured image corresponding to each of the plurality of regions;

a first determining module 103, configured to determine, for each region, the truth value information of the pedestal region marked by the target acquired image corresponding to the region as the truth value information of the pedestal region of each frame of acquired image corresponding to the region, to obtain a training sample set, and perform neural network training based on the training sample set.

In some implementations shown, the apparatus 100 further comprises:

a first training module 106, configured to obtain the training sample set;

utilizing a building base extraction network to obtain a roof area and an offset corresponding to each collected image in the training sample set; wherein the offset characterizes an offset between the roof region and the base region;

based on the offset obtained for each collected image, carrying out translation transformation on the roof area corresponding to the offset to obtain base areas corresponding to each collected image;

and adjusting the network parameters of the building base extraction network based on the base area true value information corresponding to each acquired image and the base areas obtained by aiming at each acquired image.

In some implementations shown, the apparatus 100 further comprises:

a second labeling module 104, configured to perform base position true value information labeling on the target captured image corresponding to each region respectively;

a second determining module 105, configured to determine, for each region, the true-value information of the base position marked by the target acquired image corresponding to the region as the true-value information of the base position of each frame of the acquired image corresponding to the region, so as to obtain a training sample set.

In some implementations shown, the apparatus 100 further comprises:

a second training module 107, configured to obtain the training sample set;

acquiring a roof area, an offset and a roof position respectively corresponding to each acquired image in the training sample set by utilizing a roof area extraction network, an offset extraction network and a roof position extraction network which are included in a building base extraction network, wherein the offset represents the offset between the roof area and the base area;

adjusting network parameters of the roof area extraction network based on the base area true value information corresponding to each acquired image, and the roof area and the offset acquired by each acquired image;

and adjusting network parameters of the roof position extraction network and the offset extraction network based on the base position truth value information corresponding to each acquired image and the roof position and the offset obtained by each acquired image.

In some implementations shown, the second training module 107 is configured to:

for each frame of image in each collected image, translating the truth value information of the base area by using the offset corresponding to the image to obtain the truth value information of a first roof area corresponding to the image;

obtaining region loss information corresponding to the image based on the first rooftop region truth information corresponding to the image and a rooftop region obtained for the image;

and adjusting the network parameters of the roof area extraction network through back propagation based on the area loss information corresponding to each acquired image.

In some implementations shown, the second training module 107 is configured to:

for each frame of image in each collected image, translating the roof position corresponding to the image by using the offset corresponding to the image to obtain the base position corresponding to the image;

obtaining position loss information corresponding to the image based on the base position truth value information corresponding to the image and the base position obtained aiming at the image;

and adjusting network parameters of the roof position extraction network and the offset extraction network through back propagation based on the position loss information corresponding to each acquired image.

In some illustrated implementations, at least a portion of the collected images of the training sample set are further annotated with second rooftop region truth information, offset truth information, and rooftop location truth information;

the apparatus 100 further comprises at least one of:

a first adjusting module, configured to adjust network parameters of the roof region extraction network based on second roof region truth information labeled by the at least part of the collected images and a roof region obtained for the at least part of the collected images;

a second adjusting module, configured to adjust a network parameter of the offset extraction network based on the true offset value information of the at least part of the acquired image annotation and an offset obtained for the at least part of the acquired image;

and the third adjusting module is used for adjusting the network parameters of the roof position extraction network based on the real value information of the roof position marked by the at least part of the collected images and the roof position obtained aiming at the at least part of the collected images.

In some illustrated implementations, the at least partially acquired image is further annotated with building border truth information; the apparatus 100 further comprises:

the extraction module is used for extracting a building frame extraction network included by the building base extraction network and extracting a building frame corresponding to at least part of the acquired images; wherein the building border extraction network comprises the feature extraction network;

a fourth adjusting module, configured to adjust a network parameter of the building frame extraction network based on the real-valued information of the building frame labeled by the at least partially acquired image and the building frame obtained for the at least partially acquired image.

In some implementations shown, the apparatus 100 further comprises:

and the pre-training module is used for pre-training the building foundation extraction network by utilizing the collected images marked with second roof area true value information, offset true value information and roof position true value information in the training sample set.

In some implementations shown, the captured images in the training sample set further include first offset truth information; the offset indicates an offset between the roof and the base in the captured image; the device further comprises:

the offset obtaining module is used for utilizing the offset extraction network to obtain offsets corresponding to various preset angles respectively; the multiple preset angles are used for rotating the collected image or the image characteristics corresponding to the collected image;

the selection module is used for respectively rotating the first offset real information by the multiple preset angles to obtain second offset real information respectively corresponding to the multiple preset angles;

and the fourth adjusting module is used for adjusting the network parameters of the offset extraction network by using the second offset real information and the obtained offset which respectively correspond to the multiple preset angles.

In some illustrated implementations, the obtaining, by using the offset extraction network, offsets corresponding to a plurality of preset angles includes:

respectively rotating the first image features corresponding to the acquired images by multiple preset angles by using an offset extraction network to obtain second image features respectively corresponding to the multiple preset angles;

and obtaining offsets respectively corresponding to the multiple preset angles based on the second image characteristics.

Corresponding to any one of the implementation modes, the application also provides an image processing device. The apparatus may include:

the receiving module is used for receiving the remote sensing image to be processed;

the extraction module is used for extracting a network by utilizing a building base and extracting a building roof area and an offset in the remote sensing image to be processed; the building base extraction network is obtained by training through a neural network training method shown in any one of the implementation modes;

and the translation module is used for carrying out translation transformation on the roof area by utilizing the offset to obtain a building base area corresponding to the remote sensing image to be processed.

The embodiments of the neural network training device and/or the image processing device shown in the present application can be applied to electronic devices. Accordingly, the present application discloses an electronic device, which may comprise: a processor.

A memory for storing processor-executable instructions.

Wherein the processor is configured to invoke executable instructions stored in the memory to implement the aforementioned neural network training method and/or image processing method.

Referring to fig. 11, fig. 11 is a schematic diagram of a hardware structure of an electronic device shown in the present application.

As shown in fig. 11, the electronic device may include a processor for executing instructions, a network interface for making network connections, a memory for storing operational data for the processor, and a non-volatile memory for storing instructions corresponding to the neural network training apparatus and/or the image processing apparatus.

The embodiments of the apparatus may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. In terms of hardware, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 11, the electronic device in which the apparatus is located in the embodiment may also include other hardware according to an actual function of the electronic device, which is not described again.

It is to be understood that, in order to increase the processing speed, the device-corresponding instruction may also be directly stored in the memory, which is not limited herein.

The present application proposes a computer-readable storage medium, which stores a computer program, which can be used to cause a processor to perform the aforementioned neural network training method and/or image processing method.

One skilled in the art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

"and/or" in this application means having at least one of the two, for example, "a and/or B" may include three schemes: A. b, and "A and B".

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

Specific embodiments of the present application have been described above. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this application may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware that may include the structures disclosed in this application and their structural equivalents, or combinations of one or more of them. Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs may include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer may include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data can include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Although this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or of what may be claimed, but rather as merely describing features of particular disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the described embodiments is not to be understood as requiring such separation in all embodiments, and it is to be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the present application to the particular embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principles of the present application should be included within the scope of the present application.

Claims

1. A neural network training method, comprising:

acquiring an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles;

performing base region truth value information annotation on at least one frame of target acquisition image corresponding to each region in the plurality of regions respectively;

and for each region, determining the base region truth value information marked on the target acquisition image corresponding to the region as the base region truth value information of each frame acquisition image corresponding to the region to obtain a training sample set so as to carry out neural network training based on the training sample set.

2. The method of claim 1, further comprising:

acquiring the training sample set;

3. The method of claim 1, further comprising:

marking base position truth value information of the target acquisition image corresponding to each region respectively;

and aiming at each region, determining the base position truth value information marked on the target acquisition image corresponding to the region as the base position truth value information of each frame acquisition image corresponding to the region to obtain a training sample set.

4. The method of claim 3, further comprising:

acquiring the training sample set;

5. The method of claim 4, wherein adjusting network parameters of the rooftop area extraction network based on the real-valued information of the base area corresponding to each of the collected images and the rooftop area and the offset obtained for each of the collected images comprises:

6. The method according to claim 4 or 5, wherein the adjusting network parameters of the roof position extraction network and the offset extraction network based on the real-value information of the base position corresponding to each of the acquired images and the roof position and the offset obtained for each of the acquired images comprises:

7. The method of any of claims 4-6, the rooftop area extraction network, offset extraction network, and the rooftop location extraction network share a feature extraction network.

8. The method of claim 7, wherein at least some of the collected images of the training sample set further annotate second rooftop region truth information, offset truth information, and rooftop location truth information;

the method further comprises at least one of:

adjusting network parameters of the roof region extraction network based on second roof region truth information labeled by the at least partially acquired image and a roof region obtained for the at least partially acquired image;

adjusting network parameters of the offset extraction network based on the offset truth information of the at least partially acquired image annotation and an offset obtained for the at least partially acquired image;

adjusting network parameters of the roof position extraction network based on the real-valued roof position information labeled by the at least part of the collected images and the obtained roof position of the at least part of the collected images.

9. The method of claim 8, the at least partially acquired image further annotated with building border truth information; the method further comprises the following steps:

extracting the building frame corresponding to the at least part of the acquired image by using a building frame extraction network included in the building base extraction network; wherein the building border extraction network comprises the feature extraction network;

adjusting network parameters of the building border extraction network based on the building border true value information of the at least partially captured image annotation and the building border obtained for the at least partially captured image.

10. The method according to any of claims 4-9, further comprising:

and pre-training the building base extraction network by utilizing the collected images marked with second roof area true value information, offset true value information and roof position true value information in the training sample set.

11. The method according to any one of claims 4-10, wherein the captured images in the training sample set further comprise first offset true information; the offset indicates an offset between the roof and the base in the captured image; the method further comprises the following steps:

obtaining offsets corresponding to the multiple preset angles respectively by using the offset extraction network; the multiple preset angles are used for rotating the collected image or the image characteristics corresponding to the collected image;

respectively rotating the first offset real information by the multiple preset angles to obtain second offset real information respectively corresponding to the multiple preset angles;

and adjusting the offset to extract the network parameters of the network by using the second offset real information and the obtained offset which respectively correspond to the plurality of preset angles.

12. The method of claim 11, wherein obtaining the offsets corresponding to the plurality of preset angles by using the offset extraction network comprises:

13. An image processing method comprising:

receiving a remote sensing image to be processed;

extracting a building roof area and an offset in the remote sensing image to be processed by utilizing a building base extraction network; wherein the building base extraction network is trained by the neural network training method of any one of claims 1-12;

14. A neural network training device, comprising:

an acquisition module for acquiring an image set; the image set includes acquired images respectively corresponding to a plurality of regions; under the condition that the same area corresponds to multiple frames of collected images, at least two frames of target images corresponding to the same area have different collection angles;

the first labeling module is used for performing base region truth value information labeling on at least one frame of target acquisition image corresponding to each region in the plurality of regions;

the first determining module is configured to determine, for each region, the truth value information of the pedestal region labeled in the target captured image corresponding to the region as the truth value information of the pedestal region of each frame of the captured image corresponding to the region, to obtain a training sample set, and perform neural network training based on the training sample set.

15. An image processing apparatus comprising:

the extraction module is used for extracting a network by utilizing a building base and extracting a building roof area and an offset in the remote sensing image to be processed; wherein the building base extraction network is trained by the neural network training method of any one of claims 1-12;

16. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the neural network training method according to any one of claims 1 to 12 and/or the image processing method according to claim 13 by executing the executable instructions.

17. A computer-readable storage medium, which stores a computer program for causing a processor to execute the neural network training method of any one of claims 1-12 and/or the image processing method of claim 13.