CN111968240B

CN111968240B - Three-dimensional semantic annotation method of photogrammetry grid based on active learning

Info

Publication number: CN111968240B
Application number: CN202010919006.4A
Authority: CN
Inventors: 荣梦琪; 申抒含; 胡占义; 时天欣; 朱灵杰
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2022-02-25
Anticipated expiration: 2040-09-04
Also published as: CN111968240A

Abstract

The invention belongs to the field of computer vision, and particularly relates to a photogrammetry grid three-dimensional semantic annotation method, system and device based on active learning, aiming at solving the problem of poor annotation robustness of the three-dimensional semantic annotation method. The method comprises the following steps: acquiring a city street view image to be marked; obtaining semantic segmentation results of each image; back projecting the semantic segmentation result to the three-dimensional grid model to obtain an initial three-dimensional semantic network model; fusing the initial three-dimensional semantic network model; judging the iteration times, the inconsistent number of the patch class labels and the total number; acquiring uncertainty of each image; constructing a new image set, calculating the integral uncertainty and dispersion of the newly constructed image set, weighting, and taking the image set corresponding to the weighted average minimum value as a tth image set; if t reaches a threshold value, updating the semantic segmentation network; and obtaining the marked three-dimensional semantic network model. The invention improves the robustness of three-dimensional semantic annotation.

Description

Three-dimensional semantic annotation method of photogrammetry grid based on active learning

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a three-dimensional semantic labeling method, a three-dimensional semantic labeling system and a three-dimensional semantic labeling device for a photogrammetric grid based on active learning.

Background

In recent years, traditional geometry-based three-dimensional reconstruction has reached a relatively mature stage, and off-the-shelf commercial, open-source three-dimensional reconstruction and photogrammetry software can help us generate large-scale urban models from a large number of aerial images captured by drones. Many scholars are not satisfied with simply obtaining structural information of a scene and then projecting to the expression and understanding of three-dimensional scenes, including three-dimensional semantic annotation of large urban models, which can help us know "what" is "where" for photogrammetric three-dimensional meshes. Undoubtedly, the city model with richer information can be better applied to smart cities, city planning, virtual reality, automatic driving, and the like.

The marking of the full-automatic large-scale three-dimensional grid model mainly has two ideas, one is to directly carry out semantic segmentation on the three-dimensional grid or the three-dimensional point cloud, and the design of the three-dimensional network model is much more complex than two-dimensional due to the irregular, unstructured and disordered forms of the point cloud. Although some current deep learning-based labeling methods achieve significant achievements in three-dimensional object recognition and semantic segmentation, such large-scale three-dimensional models cannot be solved well. In addition, deep learning faces a serious obstacle: absent a large amount of fine-labeled training data, especially for urban scenarios, the number of patches in the mesh model can reach millions or even more, which seems to be an impossible task. Worse still, there is no excellent interactive software to label on three-dimensional space.

Since the photogrammetric mesh is obtained by image reconstruction, there is a calibrated image in addition to the three-dimensional mesh model. Another idea is to segment on a two-dimensional image to obtain pixel-level semantic information, then backproject the two-dimensional segmentation results onto a three-dimensional mesh model using calibrated camera parameters, and continue to fuse them together to form a three-dimensional semantic model. In the method, the quality of two-dimensional semantic segmentation is crucial, the performance of a three-dimensional semantic model is greatly influenced directly, a general two-dimensional segmentation network is required to adapt to various scenes, but the method is difficult or even impractical, particularly for aerial scenes with large ground object variation, each scene needs a large number of marking samples, and although fine adjustment of the segmentation network can reduce the burden of manual marking to a certain extent, the problem that which data samples are selected for marking to ensure high-quality performance cannot be solved well. Based on the three-dimensional semantic annotation method, the invention provides a photogrammetry grid three-dimensional semantic annotation method based on active learning.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem of poor labeling robustness caused by too many data samples to be labeled and difficult training of a semantic segmentation network in the conventional three-dimensional semantic labeling method, the invention provides a three-dimensional semantic labeling method of a photogrammetric mesh based on active learning, which comprises the following steps:

step S10, acquiring a city street view image to be annotated as an input image;

step S20, for each input image, obtaining a corresponding semantic segmentation result through a semantic segmentation network subjected to fine tuning training, and back projecting the corresponding semantic segmentation result onto a three-dimensional grid model through ray intersection to obtain an initial three-dimensional semantic network model as a first model;

step S30, for the first model, constructing smooth constraint of adjacent patch class labels, and fusing through a Markov random field to obtain a second model;

step S40, if the current iteration number is 1, executing step S50; otherwise, counting the number of inconsistent patch type labels corresponding to the currently obtained second model and the last obtained second model, if the ratio of the number to the total number of patches of the second model is greater than a set threshold value, skipping to the step S50, otherwise skipping to the step S100;

step S50, for each patch in the second model, according to the corresponding relation between the patch and each pixel point of the input image, and in combination with the confidence degree of the category label, obtaining the uncertainty of each input image through a preset first method;

step S60, when t is 1, acquiring an input image with the minimum uncertainty to construct a first image set;

step S70, let t be t +1, construct a new image set for each of the remaining input images together with the t-1 th image set, and calculate the average value of the uncertainty of the newly constructed image set as the overall uncertainty;

step S80, respectively calculating the intersection ratio of the areas of corresponding patches of the images in the second model of the newly constructed image set as the dispersion of the intersection ratio, and carrying out weighted average on the dispersion and the integral uncertainty of the dispersion; taking the newly-constructed image set corresponding to the weighted average minimum value as a tth image set;

step S90, judging whether t reaches a set threshold value, if not, executing the steps S70 and S80 in a circulating way, otherwise, labeling the input images in the t-th image set, updating parameters of the semantic segmentation network based on all the labeled input images, and jumping to the step S20 after updating;

and S100, taking the currently obtained second model as a finally labeled three-dimensional semantic network model.

In some preferred embodiments, the semantic segmentation network, and the fine tuning training method of the semantic segmentation network are:

step A10, randomly selecting a set number of city street view images from a first training sample set for labeling; the first training sample set is a data set constructed by all city street view images;

step A20, if the current iteration number is 1, fine tuning the first network by adopting a randomly selected and labeled city street view image, otherwise, fine tuning the first network based on the city street view image in the second training sample set; the first network is based on a DeepLab network trained by a Cityscapes data set; the second training sample set is a data set constructed by marked city street view images;

step A30, replacing the fine-tuned classification layer of the preset category of the first network with a softmax layer to obtain a second network;

step A40, based on each city street view image in the first training sample set, and in combination with the second network, obtaining a marked t image set by the method of steps S20-S80;

and step A50, adding the labeled tth image set into a second training sample set which is constructed in advance, and jumping to step A20.

In some preferred embodiments, step S30, "construct smooth constraint of labels of adjacent patch categories", is performed by:

wherein f is₁、f₂Two adjacent patches are shown as being of a type,

indicates the class label, k, corresponding to two patches_min、k_maxMinimum, maximum, w representing the principal curvature of the patch_min、w_maxA principal direction vector representing the principal curvature of the patch, and s represents the set scale factor.

In some preferred embodiments, step S40, "fusion by markov random field, to obtain the second model", includes:

constructing an energy function of likelihood data items and the smooth constraint; the likelihood data items are likelihood distribution of each patch and a corresponding category label in the initial three-dimensional semantic network model;

and calculating the global optimal value of the energy function by using a graph cut algorithm alpha-expansion to obtain a cut result of each adjacent patch, and combining the first model to obtain a second model.

In some preferred embodiments, the "constructing the energy function of the likelihood data item and the smoothing constraint" is performed by:

E(l)＝∑_f∈FD_f(l_f)+λ∑_(f，q)V_f，q(l_f，l_q)

wherein E (l) represents an energy function (l)_f，l_q) Representing a patch F, a class label corresponding to q, F representing the set of all patches,

f represents a patch in the three-dimensional semantic network model,

indicating that patch f belongs to each class l_fλ represents a preset balance factor.

In some preferred embodiments, in step S50, "the uncertainty of each input image is obtained by a preset first method", which is:

obtaining the label confidence of the pixel points in each input image according to the following formula:

wherein u is_pThe confidence of the label of each pixel point of the city street view image is represented,

representing the label confidence of each patch F in the second model, p representing each city street view image pixel point, and F representing the set of all patches in the second model;

and summing the label confidence degrees of the pixel points of the input images to obtain the uncertainty of each image.

In a second aspect of the present invention, a three-dimensional semantic annotation system for photogrammetric grids based on active learning is provided, the system comprising: the system comprises an image acquisition module, a semantic segmentation module, a fusion module, an iteration judgment module, an uncertainty acquisition module, an image set construction module, an overall uncertainty acquisition module, a dispersion acquisition module, a circulation module and an output module;

the image acquisition module is configured to acquire a city street view image to be marked as an input image;

the semantic segmentation module is configured to obtain a corresponding semantic segmentation result of each input image through a pre-trained semantic segmentation network, and back-project the corresponding semantic segmentation result to the three-dimensional mesh model through ray intersection to obtain an initial three-dimensional semantic network model as a first model;

the fusion module is configured to construct smooth constraints of the class labels of adjacent patches of the first model, and fuse the smooth constraints through a Markov random field to obtain a second model;

the iteration judging module is configured to execute the uncertainty acquiring module if the current iteration number is 1; if the ratio of the number to the total number of the second model patches is larger than a set threshold value, skipping to the step uncertainty acquisition module, otherwise skipping to the output module;

the uncertainty acquisition module is configured to acquire uncertainty of each input image through a preset first method according to a corresponding relation between each patch in the second model and each input image pixel point and a confidence coefficient of a category label of each patch;

the image set construction module is configured to acquire an input image with the minimum uncertainty to construct a first image set when t is 1;

the integral uncertainty acquisition module makes t equal to t +1, constructs a new image set for the rest input images and the t-1 image set, and calculates the uncertainty average value of the newly constructed image set as the integral uncertainty;

the dispersion degree acquisition module is configured to respectively calculate the intersection and comparison of the areas of corresponding surface patches of the images in the second model of the newly constructed image set as the dispersion degree of the newly constructed image set, and carry out weighted average on the dispersion degree and the integral uncertainty of the newly constructed image set; taking the newly-constructed image set corresponding to the weighted average minimum value as a tth image set;

the circulation module is configured to judge whether t reaches a set threshold value, if not, the integral uncertainty acquisition module and the dispersion acquisition module are executed in a circulation mode, otherwise, the input images in the t-th image set are labeled;

and the output module is configured to take the currently obtained second model as a finally labeled three-dimensional semantic network model.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are loaded and executed by a processor to implement the above-mentioned three-dimensional semantic annotation method based on the photogrammetric mesh for active learning.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional semantic labeling method for active learning-based photogrammetry grids.

The invention has the beneficial effects that:

the invention improves the robustness of three-dimensional semantic annotation. Firstly, a semantic segmentation network pre-trained in a Cityscapes data set is subjected to fine tuning, a probability graph output by the semantic segmentation network is projected onto a three-dimensional grid model after fine tuning for fusion, and a three-dimensional semantic grid model with each patch having a semantic label and a thermal model showing the confidence coefficient of each patch are output.

Secondly, the invention provides a complete framework aiming at the semantic modeling of the grid of the large-scale urban scene. Under the inspiration of active learning, an iterative algorithm is used for providing a labeling suggestion for selecting the most effective unlabeled sample for labeling, so that the quality of semantic segmentation is greatly improved, and the quality of a three-dimensional semantic grid model is further improved. The labeling suggestion consists of uncertainty and most representative (dispersion), and simultaneously considers the consistency of two-dimensional segmentation results and three-dimensional geometry. The method only needs less manual labeling, but does not reduce the labeling quality of the three-dimensional model, greatly reduces the difficulty of the semantic segmentation network, and improves the robustness of the three-dimensional semantic labeling.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a schematic flowchart of a three-dimensional semantic labeling method for photogrammetric grids based on active learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a three-dimensional semantic annotation system for photogrammetry grids based on active learning according to an embodiment of the invention;

FIG. 3 is a simplified flowchart of a three-dimensional semantic labeling method for photogrammetric grids based on active learning according to an embodiment of the present invention;

FIG. 4 is a schematic representation of a two-dimensional image projected back onto a three-dimensional mesh model according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of a scene texture map corresponding to an acquired Urban1, Urban2 dataset according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a verification result obtained at Urban1 by the three-dimensional semantic annotation method based on photogrammetric mesh with active learning according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a verification result obtained at Urban2 by the three-dimensional semantic annotation method based on photogrammetric mesh with active learning according to an embodiment of the present invention;

FIG. 8 is a schematic illustration of a three-dimensional semantic model and thermal model generated on an Urban1, Urban2 dataset based on a randomly selected image generation method of an embodiment of the present invention;

FIG. 9 is a schematic diagram of a three-dimensional semantic model and thermal model generated on Urban1, Urban2 datasets by a random forest based method of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

A three-dimensional semantic annotation method for photogrammetry grids based on active learning according to a first embodiment of the present invention is shown in fig. 1, and the method includes the following steps:

step S10, acquiring a city street view image to be annotated as an input image;

In order to more clearly describe the three-dimensional semantic annotation method of the photogrammetric mesh based on active learning of the present invention, the following is a detailed description of each step in an embodiment of the method of the present invention.

In the following embodiment, a fine tuning training process table of the semantic segmentation network is detailed first, and a process of acquiring a labeled three-dimensional semantic network model by using a photogrammetry grid three-dimensional semantic labeling method based on active learning is detailed.

1. Fine tuning training process for semantic segmentation network

Since the method of the present invention is based on the results of two-dimensional image segmentation, a semantic segmentation network with good performance is of paramount importance. However, good segmentation usually requires a large number of manually labeled images, while fine labeling usually needs to be done by a professional and is time consuming, so it is important to obtain elegant results with as few labeled images as possible. Here we introduce active learning, which makes our method iterative. In each iteration, the model selects the most valuable image to label according to the output three-dimensional semantic model, so that a continuously expanded labeled data set is formed, and then the labeled data set is sent to a training neural network to improve and promote. The specific process is as follows:

in this embodiment, the first training sample set is a data set constructed by all city street view images, and all images in the first training sample set are unlabeled images.

Step A20, if the current iteration number is 1, fine tuning the first network by adopting a randomly selected and labeled city street view image, otherwise, fine tuning the first network based on the city street view image in the second training sample set; the first network is based on a DeepLab network trained by a Cityscapes data set; the second training sample set is a data set constructed by marked city street view images.

In this embodiment, different city street view images are selected according to the number of iterations for labeling, and fine tuning training is performed on a semantic segmentation network trained based on a cityscaps data set. Among them, the semantic division network preferably uses a deplab v3+ network as the first network in the present invention.

If the current iteration number is 1, randomly selecting a set number of city street view images from the first training sample set for labeling, and finely adjusting the first network based on the labeled images (namely, labeled samples). Otherwise, selecting high-quality images from the first training sample set for labeling according to the active learning method, constructing a second training sample set, and finely adjusting the first network based on the images in the second training sample set.

in this embodiment, a classification layer (output is a preset category) of the last layer of the first network is modified into a classification layer that outputs each category probability corresponding to each pixel, that is, a softmax layer, and the modified first network is used as the second network.

Step A40, constructing a three-dimensional semantic model based on city street view images in a first training sample set and in combination with a second network, and selecting training samples from the first training sample set for marking;

for general semantic segmentation problems and other deep learning algorithms, it is a straightforward approach to use more labeled data to improve its generalization capability. However, in our scenario, what we want is to have an accurate three-dimensional labeling result on each photogrammetric grid, and because the flying height, terrain category, imaging conditions, and the like of each scenario are different and even greatly changed, it is more reasonable to train a specific segmentation model for a specific scenario. In addition, according to our experiments, we also find that combining different city data sets into a training set to fine tune a semantic segmentation network makes it difficult to achieve high accuracy for two scenes simultaneously. Therefore, for each reconstructed scene, we need to train a separate semantic segmentation network to best fit the current scene image, which makes our active selection strategy more important, since we can minimize the number of manual markers. The specific process is as follows:

step A41, for each city street view image in the first training sample set, obtaining a corresponding semantic segmentation result through a second network, and back projecting the corresponding semantic segmentation result onto a three-dimensional grid model through ray intersection to obtain an initial three-dimensional semantic network model as a first model;

in this embodiment, 2D-3D fusion is performed on each city street view image in the first training sample set, and the fused three-dimensional semantic network model is used as the first model. The 2D-3D fusion is to use the second network to segment all images to obtain a semantic segmentation result for each image, i.e. the probability that each image pixel belongs to all possible classes. After segmentation based on a semantic segmentation network, a semantic probability map is obtained, in the form of likelihood that pixel levels correspond to all possible classes, which is a vector d with one dimension as the number of classes_p. And then, according to the calibrated camera parameters (internal reference and external reference) corresponding to each city scene image, calculating the corresponding relation between the pixels of the image and the surface patches of the grid model through ray intersection. In other words, each patch F of the mesh model (F ∈ F, F denotes the set of all patches) is obtained by back-projecting which pixel regions of which images, as shown in fig. 4,r is the rotation parameter of the camera, t is the translation parameter of the camera, each image has specific camera parameter information, and is denoted by the corresponding subscript numbers 1, 2, and 3, respectively. The simplest weighted-average method would then be used to integrate the scores for each pixel's categories, then assign each patch a most likely category/_f. After the above processing, we can obtain an initial three-dimensional semantic mesh model, the likelihood distribution d of the patch f_fAs shown in equation (1):

wherein I represents a first preset training sample set, Ω_f，iRepresenting patch f projecting a set of pixels on image i,

representing the total number of projected pixels of patch f on the first set of pre-set training samples.

Step A42, constructing smooth constraints of adjacent patch class labels of the initial three-dimensional semantic network model, fusing through a Markov random field, and taking the fused initial three-dimensional semantic network model as a second model;

the initial three-dimensional semantic model is a coarse three-dimensional semantic model, and there may be many single patches or patches that are misclassified, especially at the junction of two correspondences, resulting in an unsmooth three-dimensional semantic model.

Therefore, we introduce a smoothing constraint in the present invention to optimize the assignment of patch labels, which means that adjacent patches are likely to be assigned the same label except for those patches that are located at the boundary where the local differential geometry changes are large. Given two adjacent patches f₁、f₂Respectively corresponding category labels are

Constructed smoothing constraints

As shown in equation (2) (3):

wherein the content of the first and second substances,

denotes a 6 × 1 vector, k_min、k_maxMinimum, maximum, w representing the principal curvature of the patch_min、w_maxA principal direction vector representing the principal curvature of the patch, and s represents the set scale factor.

The distribution of the patch type labels can be regarded as the problem of energy minimization, and a likelihood data item and a smooth constrained energy function are constructed; and the likelihood data items are likelihood distribution of each patch and the corresponding class label in the initial three-dimensional semantic network model. The energy function constructed is shown in equation (4):

E(l)＝∑_f∈FD_f(l_f)+λ∑_(f，q)V_f，q(l_f，l_q) (4)

wherein E (l) represents an energy function (l)_f，l_q) The class labels corresponding to the patches f, q,

measure the label l_fA likelihood assigned to patch f, f representing a patch in the three-dimensional semantic network model,

the probability that the patch f belongs to each category is represented, and λ represents a preset balance factor.

And calculating the global optimal value of the energy function by using a graph cut algorithm alpha-expansion to obtain the cut result of each adjacent patch, and combining the first model to obtain a second model.

Each patch of the second model has a semantic label with confidence, and a value from 0 to 1 characterizes the reliability of a given label. All these confidences can be transformed by a jet (jet is a variant of the hsv (color saturation value), matlab predefined color map matrix) to generate a thermal model, and the probabilities are converted into RGB channels (i.e. the confidence values for patches 0-1, divided into three-channel RGB values, where the color comparison criterion used is a jet color table).

Step A43, if the current iteration number is 1, executing step A44; if not, counting the number of inconsistent patch type labels of the currently obtained second model and the second model obtained last time, if the ratio of the number to the total number of patches of the second model is greater than a set threshold value, skipping to the step A44, otherwise, outputting the currently obtained second model as a labeled three-dimensional semantic network model;

in this embodiment, according to the second model constructed each time, the class labels of the patches corresponding to the second model constructed last time are compared, the number of inconsistency is counted, and if the ratio of the number to the total patch number of the second model is smaller than the threshold value, which indicates that the semantic segmentation network tends to be stable, the currently obtained second model is used as the finally labeled three-dimensional semantic network model, and the training is ended. Otherwise, the active learning method of the invention continues to select samples to label, and trains the semantic segmentation network. The above is only one preferable in the present invention, and in other embodiments, the determination condition may be adjusted according to actual needs.

Step A44, for each patch in the second model, according to the corresponding relation between the patch and each city street view image pixel in a first preset training sample set, and by combining the corresponding label confidence coefficient, acquiring the label confidence coefficient of each city street view image pixel through a preset first method; summing the label confidence degrees corresponding to the pixel points of each image to obtain the uncertainty of each image;

since the optimized three-dimensional semantic mesh model (the second model) combines two-dimensional semantic segmentation and three-dimensional geometric information, it can be used as a more reliable monitor to measure the segmentation quality and help us determine the next batch of training images to achieve a high quality representation. In order to make the best use of the information provided by the three-dimensional semantic mesh model, a selection proposal is provided, which combines the uncertainty of the images (directly measure the segmentation quality of each image) and the similarity between the images (approximate the problem as the maximum subset coverage problem).

Among these, the less certainty of the image: the most straightforward way to find the most valuable labeling area is to use uncertainty samples, the so-called lowest scoring subset. The uncertainty of each pixel is not judged by the accuracy of the two-dimensional semantic segmentation network, but is obtained by re-projecting the class label of each patch and its confidence onto the two-dimensional image using the three-dimensional semantic mesh model as a supervision. The uncertainty of each image is then the sum of all its pixels. The method specifically comprises the following steps:

for a pixel p in an image i, if it is visible in the f-th patch of the first model (i.e., a pixel point in the image has a corresponding patch on the first model), its uncertainty is defined as 1 minus the confidence of the class label of the patch

If no patch corresponds to it, meaning that the label of the pixel has no meaning for the generation of the three-dimensional semantic mesh model, its uncertainty is set to 1. As shown in equation (5):

representing patches f in the second modelThe confidence of the tag.

And summing the confidence degrees of the labels corresponding to the pixel points of the images to obtain the uncertainty of the images. This is not entirely reasonable but a simplification, since the class labels of the pixels must have a certain positive impact on the performance of the segmentation network.

Step A45, when t is 1, acquiring an input image with minimum uncertainty to construct a first image set;

in this embodiment, when t is 1, first, according to the uncertainty of each city street view image obtained in step a44, a first image set is selected and constructed by obtaining the city street view image with the smallest uncertainty, and the uncertainty corresponding to the city street view image with the smallest uncertainty is used as the overall uncertainty of the first image set.

And step A46, making t equal to t +1, constructing a new image set for each image left in the first training sample set and the t-1 image set, and calculating the uncertainty average value of the newly constructed image set as the overall uncertainty.

In this embodiment, the city street view image with the minimum uncertainty in the remaining images in the first preset training sample set is obtained, and a new image set is constructed with the images in the previous image set. And calculating the average value of the uncertainty in the image set as the overall uncertainty of the currently constructed image set.

The overall uncertainty is obtained as shown in equation (6):

wherein, I_SE I represents the newly constructed image set,

and (4) representing the integral uncertainty of the current image set, wherein i is a natural number, and represents subscript, namely the ith city street view image in the constructed image set.

Step A47, respectively calculating the intersection ratio of the areas of corresponding patches of the images in the second model of the newly constructed image set as the dispersion of the intersection ratio, and carrying out weighted average on the dispersion and the integral uncertainty of the dispersion; and taking the newly constructed image set corresponding to the weighted average minimum value as the tth image set.

Using uncertainty alone as a measure, it is easy to select a collection of images that are clustered together and have a high similarity, and such labeling should be avoided as much as possible, which has no significance. The selected region is expected to have as many useful features of the unlabeled image as possible. In our method, we use the coverage area as another measure, i.e. how many patches the image subset can see, as shown in equation (7):

wherein the content of the first and second substances,

is represented by_SThe number of patches that can be covered in total.

Since in our problem a small batch of images cannot cover all patches in the mesh model, it appears that the coverage never reaches 1. To measure consistency, which has also proven feasible, we redefine

The terms are such that they can be evenly distributed between 0 and 1. Thus, improvements

As shown in equation (8):

wherein, F_∩、F_∪Respectively representing the intersected area and the area of the corresponding patch in the second model of the newly constructed image set,

representing the corresponding dispersion of the newly constructed image set. The first image set only represents the image with the largest uncertainty in the current iteration, and the dispersion can be said to be a concept without dispersion, and the dispersion is also said to be 0, because the single images do not have the concept of intersection.

And carrying out weighted summation on the integral uncertainty and dispersion of each image set, selecting the image set corresponding to the minimum value after weighted summation as an image set to be labeled, and labeling the images in the image set. As shown in formula (9):

where β represents a preset constraint balance factor.

And A48, judging whether t reaches a set threshold value, if not, circularly executing the steps A46 and A47, otherwise, labeling the input images in the t-th image set, adding a second training sample set, jumping to the step A20, and circularly training the semantic segmentation network.

In each annotation proposal, considering the efficiency of training the network, we aim to select several images from the unlabeled ones, all with low uncertainty and large coverage area (dispersion). However, the optimization problem is actually an NP-hard problem, and the only possible solution is the greedy algorithm, i.e. by iteratively selecting one image at a time such that the target value is minimized until the selected image reaches a set number. It is noted that, for the first time, special, the initial subset I_SIs an empty set, we can select the one with the lowest uncertainty regardless of coverage.

2. Three-dimensional semantic annotation method of photogrammetry grid based on active learning

Step S10, acquiring a city street view image to be annotated as an input image;

in this embodiment, a city street view image to be annotated is obtained first.

in this embodiment, the semantic result corresponding to each to-be-labeled city street view image, that is, the probability that each pixel belongs to all possible categories, is obtained based on the semantic segmentation network trained by the training method. And acquiring camera parameters corresponding to the city street view image to be labeled, and back projecting a semantic segmentation result corresponding to the camera parameters onto the three-dimensional grid model through ray intersection to obtain an initial three-dimensional semantic network model serving as a first model.

Step S30, for the initial three-dimensional semantic network model, constructing smooth constraint of adjacent patch class labels, and fusing through a Markov random field to obtain a second model;

in the present embodiment, steps S30-S80 correspond to steps A42-A47 in the training process, and are not described in detail herein.

In this embodiment, when it is detected that the ratio of the first number to the total number of patches of the second model is smaller than a set threshold, the second model obtained at present is output as the labeled three-dimensional semantic network model. The first quantity is the quantity that the corresponding patch type labels of the currently obtained second model and the last obtained second model are inconsistent.

In addition, to verify the effectiveness of the present invention, the method of the present invention was evaluated on two large urban scene data, which were all collected by us, since there was no open dataset for the finely labeled urban scene. For both datasets Urban1, Urban2, three-dimensional mesh models were generated using the off-the-shelf three-dimensional reconstruction systems openMVG and openMVS. For Urban1, the mesh model contained 4908453 patches, which were reconstructed from 2817 calibration images at 999 × 1401 resolution. Compared to Urban1, there are fewer images for Urban2, but at a higher resolution of 2000 x 3004, with 2993761 patches in the mesh model.

Fig. 5(a) (c) shows texture maps of two scenes. Urban1 is more complex, consisting of Urban arterial roads, several office buildings, a large construction site and many pieces of green vegetation; however, Urban2 is relatively single, and only has city main roads, residential areas and a large square. However, the Urban2 is shot in the daytime of autumn and winter, and is affected by the light more seriously, which brings some troubles to data labeling and network training. Given the consistency of the city objects and the necessity for semantic labeling thereof, we define only four semantic classes on these two datasets: roads, vegetation, buildings and unlabelled. Further, for quantitative evaluation, Urban1 was manually labeled as a true value, as shown in fig. 5 (b).

To evaluate the feasibility of our algorithm, we first tested it on two data sets. For Urban1, we start with 5 randomly selected images and in each subsequent iteration, we will select 5 most valuable images that can significantly improve the quality of the segmented network and add it to the semantic modeling. Over four iterations, the semantic mesh model reaches a relatively stable level and the thermal model also becomes smoother, as shown in FIG. 6, where (c) is the area covered by the 5 images selected by our method, accounting for only a small portion of the entire scene. However, for Urban2, this scene is small, the features are simple and regular, and the area covered by a single image is relatively large, as shown in fig. 7 (c). Thus in each iteration only three images worth annotating are selected, with the other processing being the same as Urban 1. The experiment was performed in four iterations, with the results shown in table 1:

TABLE 1

Iterations	2D_Seg_Acc	Number/Percentage	3D_Seg_Acc
				iter1	0.8894	----	0.7036
iter2	0.8886	770108/0.1569	0.8168
				iter3	0.8633	317575/0.0647	0.8485
iter4	0.8351	164582/0.0335	0.8739

In table 1, iteration indicates the Number of Iterations, 2D _ Seg _ Acc indicates the segmentation accuracy of the semantic segmentation net on the two-dimensional training sample image, Number/Percentage indicates the Number/Percentage of patches with a changed (inconsistent) Number of patch labels compared with the previous iteration, and 3D _ Seg _ Acc indicates the segmentation accuracy on the three-dimensional grid.

From table 1 it can be seen that the three images selected in iter3 are substantially coincident with the previous ones, indicating that the model does not have much room for lift, and that the results of iter4 do.

Then, to verify the efficiency of our proposed annotation proposal strategy, we compared the method of the present invention with a completely randomly chosen image generation method. The results are shown in FIG. 8.

We also compared it to a random forest-based classifier provided by CGAL (reference: "CGAL, https:// doc. CGAL. org/late/Classification/index. html."), which learns the geometric features of the entries for Classification. But the method mainly aims to solve the problem of point cloud semantic classification. Although an interface is provided for mesh semantic annotation, no interactive software is provided for generating labeled mesh data in a specified format. Therefore, we manually label the point clouds of three typical urban objects for training. The segmentation results are shown in fig. 9, which is much coarser than our results.

For Urban1, 20 images were randomly selected. The semantic models generated from these randomly selected images contain more wrong labels, whose 3D _ Seg _ Acc is only 0.7837 (lower than the result in our iter 2), e.g. the building site is wrongly divided into buildings. The reason is that our method can pick out representative areas that are difficult to segment for semantic networks, and randomly selected images cannot cover these features. For Urban2, we randomly selected 16 images, more than 11 in our method iter 3. The orthoimage of the semantic model is similar to the results of iter 3. This is because for small and regular scenes, as long as the annotated data is sufficient, it can cover the features of the entire scene object as much as possible, thus generating a good model. In contrast, our method can bring the annotation contribution to the utmost, improving efficiency.

A three-dimensional semantic annotation system based on photogrammetry grid with active learning according to a second embodiment of the present invention, as shown in fig. 2, includes: the system comprises an image acquisition module 100, a semantic segmentation module 200, a fusion module 300, an iteration judgment module 400, an uncertainty acquisition module 500, an image set construction module 600, an overall uncertainty acquisition module 700, a dispersion acquisition module 800, a circulation module 900 and an output module 1000;

the image acquisition module 100 is configured to acquire a city street view image to be annotated as an input image;

the semantic segmentation module 200 is configured to obtain a semantic segmentation result corresponding to each input image through a pre-trained semantic segmentation network, and back-project the semantic segmentation result corresponding to each input image onto a three-dimensional mesh model through ray intersection to obtain an initial three-dimensional semantic network model as a first model;

the fusion module 300 is configured to construct a smooth constraint of the class labels of adjacent patches of the first model, and perform fusion through a markov random field to obtain a second model;

the iteration determining module 400 is configured to execute the uncertainty obtaining module 500 if the current iteration number is 1; otherwise, counting the number of inconsistent patch type labels corresponding to the second model obtained at present and the second model obtained last time, if the ratio of the number to the total number of patches of the second model is greater than a set threshold, skipping to the step uncertainty acquisition module 500, otherwise skipping to the output module 1000;

the uncertainty obtaining module 500 is configured to obtain, for each patch in the second model, uncertainty of each input image through a preset first method according to a corresponding relationship between the patch and each pixel of the input image and in combination with a confidence of a category label of the patch;

the image set constructing module 600 is configured to obtain an input image with the minimum uncertainty to construct a first image set when t is 1;

the overall uncertainty acquiring module 700 sets t to t +1, constructs a new image set for the remaining input images and the t-1 th image set, and calculates an average uncertainty value of the newly constructed image set as an overall uncertainty;

the dispersion obtaining module 800 is configured to calculate, for the newly constructed image set, intersection and combination ratios of corresponding patch areas of the images in the second model respectively as dispersions of the newly constructed image set, and perform weighted average on the dispersions and the overall uncertainty; taking the newly-constructed image set corresponding to the weighted average minimum value as a tth image set;

the circulation module 900 is configured to determine whether t reaches a set threshold, if not, execute the whole uncertainty acquisition module 700 and the dispersion acquisition module 800 in a circulation manner, otherwise, label the input images in the tth image set;

the output module 1000 is configured to use the currently obtained second model as the finally labeled three-dimensional semantic network model.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the three-dimensional semantic annotation system based on the photogrammetry grid with active learning provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are adapted to be loaded by a processor and to implement the above-mentioned three-dimensional semantic labeling method based on active learning photogrammetry grid.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described three-dimensional semantic labeling method for active learning-based photogrammetry grids.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," "third," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A three-dimensional semantic labeling method of photogrammetric grids based on active learning is characterized by comprising the following steps:

step S10, acquiring a city street view image to be annotated as an input image;

2. The active learning-based three-dimensional semantic labeling method for photogrammetry grids according to claim 1, wherein the semantic segmentation network comprises a fine tuning training method of:

3. The active learning-based three-dimensional semantic labeling method for photogrammetric grids according to claim 1, characterized in that in step S30, "smooth constraint of adjacent patch class labels" is constructed by:

wherein f is₁、f₂Two adjacent patches are shown as being of a type,

4. The method for three-dimensional semantic annotation of photogrammetric grids based on active learning according to claim 3, wherein in step S30 "fusion is performed by Markov random field to obtain the second model", the method comprises:

5. The active learning-based three-dimensional semantic labeling method for photogrammetry grids according to claim 4, wherein the method for constructing the energy function of likelihood data items and the smooth constraint comprises:

f represents a patch in the three-dimensional semantic network model,

6. The method for three-dimensional semantic annotation based on actively learned photogrammetric mesh according to claim 1, wherein in step S50, "the uncertainty of each input image is obtained by a preset first method", which comprises:

7. A three-dimensional semantic annotation system based on active learning photogrammetric grids, the system comprising: the system comprises an image acquisition module, a semantic segmentation module, a fusion module, an iteration judgment module, an uncertainty acquisition module, an image set construction module, an overall uncertainty acquisition module, a dispersion acquisition module, a circulation module and an output module;

the semantic segmentation module is configured to obtain a semantic segmentation result corresponding to each input image through a semantic segmentation network subjected to fine tuning training, and back-project the semantic segmentation result corresponding to each input image onto the three-dimensional mesh model through ray intersection to obtain an initial three-dimensional semantic network model serving as a first model;

8. A storage device having stored thereon a plurality of programs, wherein the programs are loaded and executed by a processor to implement the method for three-dimensional semantic annotation of active learning based photogrammetry grids of any of claims 1-6.

9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the method for three-dimensional semantic annotation of an active learning based photogrammetry mesh of any of claims 1-6.