CN112348972A

CN112348972A - Fine semantic annotation method based on large-scale scene three-dimensional model

Info

Publication number: CN112348972A
Application number: CN202011011807.7A
Authority: CN
Inventors: 何娇; 王江安
Original assignee: Shaanxi Tudou Data Technology Co ltd
Current assignee: Shaanxi Tudou Data Technology Co ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-02-09

Abstract

The invention discloses a fine semantic annotation method based on a large-scale scene three-dimensional model, which comprises the following steps of iterative execution under an Active Learning (Active Learning) frame, and S1 carries out semantic segmentation network training on CNN by using a continuously expanded labeled image set; s2 back-projecting the pixel labels in all the images to the three-dimensional grid model by using the calibrated camera parameters; s3, taking the fused semantic three-dimensional model as a supervisor; s4 the training-fusion-selection process continues until the labels of the model become stable, i.e. the percentage of different labels for the same patch in the previous and current iterations is below a threshold, η the present invention can be used to fine label large scale scene three-dimensional models reconstructed from images, the proposed method uses limited manual work, while the quality of the semantic labeling of the model can be guaranteed.

Description

Fine semantic annotation method based on large-scale scene three-dimensional model

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle oblique photography, and particularly relates to a fine semantic annotation method based on a large-scale scene three-dimensional model.

Background

In recent years, semantic annotation of three-dimensional models has been a challenging research direction. At present, there are the following two methods for automatic semantic annotation of large-scale three-dimensional models. One is to combine the three-dimensional model and semantics to reconstruct the scene. And carrying out image segmentation by adopting a pre-trained decision tree. And then reconstructing a semantic model by combining the label image and the depth map. And secondly, distributing semantic labels for the three-dimensional model. Firstly, pixel-level semantic segmentation is carried out on a two-dimensional image, and then the labels are back projected into a three-dimensional model by using calibrated camera parameters and fused together.

Since the types and shapes of three-dimensional objects in different scenes are different, it is difficult to have a general method suitable for most scenes. Three-dimensional semantic models can help humans and automated systems know "what objects" are "where" in a particular scene and have a variety of applications in the areas of autopilot, augmented reality, and robotics, among others. A fine, large-scale three-dimensional model of a scene has thousands of patches, and one of the most straightforward approaches is to label them manually. However, there is no effective tool for manually labeling each patch, and the existing deep learning techniques cannot process three-dimensional models of large-scale scenes. Therefore, it is necessary to find a method for labeling a large-scale three-dimensional scene model.

Aiming at the problems in the related art, an effective solution is not provided at present, and therefore a fine semantic annotation method based on a large-scale scene three-dimensional model is provided.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a fine semantic annotation method based on a large-scale scene three-dimensional model, and solves the problems mentioned in the background technology.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a fine semantic annotation method based on a large-scale scene three-dimensional model is characterized in that the following steps are executed in an iterative manner under an Active Learning (Active Learning) framework:

s1, performing semantic segmentation network training on the CNN by using the continuously expanded marked image set, and then acquiring a pixel-level semantic label of an unmarked image by using the trained CNN;

s2, back projecting pixel labels in all images to a three-dimensional grid model by using the calibrated camera parameters, fusing the labels and the three-dimensional grid model by using an MRF (Markov random field) optimization method, and giving an independent label to each patch by combining a two-dimensional semantic label and three-dimensional geometric characteristics;

s3, selecting a plurality of valuable images for marking by taking the fused semantic three-dimensional model as a supervisor and applying a batch image selection method, and merging the images into a training set after the images are manually marked for preparing the next iteration;

the S4 training-fusing-selecting process will continue until the labels of the model become stable, i.e., the percentage of different labels for the same patch in the previous and current iterations is below the threshold η.

Preferably, the method takes the three-dimensional grid model reconstructed by the SfM and the MVS and the calibrated image as input, outputs the three-dimensional semantic grid model, each patch is labeled with a semantic label, and different colors represent different categories.

Preferably, SfM is formed by horizontally and vertically interleaving multiple channels, each channel provides 8Gbps switching capability (super player 720 provides 20Gpbs per channel), and the maximum advantage of matrix switching is to allow multiple non-conflicting switches to be performed simultaneously and support point-to-multipoint (Multicast) switching.

Preferably, the MVS is a substrate that uses two 14Mhz Motorola 68000 CPUs for 320 × 224 resolutions (65, 536 colors maximum color, 4096 colors on screen), the sound processing chip is Z80A, there are 8 channels FM synthesis sound source and 7 channels digital stereo sound source (PSG & PCM), the system RAM is 7MB (56Mbits) and the maximum volume of the cassette is 42MB (330 Mbits).

Preferably, the semantics are segmented into tasks in computer vision, in the process, different parts in the visual input are classified into different categories according to the semantics, and through semantic understanding, each category has certain realistic significance.

Preferably, in the MRF optimization in S2, variable weight parameters are introduced into a conventional MRF image segmentation algorithm to connect the marker field model and the feature field model, so that a balance is formed between the two models, a segmentation result that can maintain image edges, image important details and region consistency is obtained, then an edge penalty function is introduced at the edges in a self-adaptive manner, the contribution of energy of a potential function to an energy function is adjusted, blurring of the edges during segmentation is reduced, and the edge positioning accuracy is improved.

(III) advantageous effects

Compared with the prior art, the invention provides a fine semantic annotation method based on a large-scale scene three-dimensional model, which has the following beneficial effects:

the method can be used for finely marking a large-scale scene three-dimensional model reconstructed by the image by determining the semantic segmentation class number, the marking data for training and the semantic segmentation for the image, and the method uses limited manpower and can ensure the semantic labeling quality of the model.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic view of an image according to the present invention;

FIG. 3 is a 3D image diagram according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

Referring to fig. 1-3, the present invention provides a technical solution: a fine semantic annotation method based on a large-scale scene three-dimensional model is characterized in that the following steps are executed in an iterative manner under an Active Learning (Active Learning) framework:

The method takes an SfM and MVS reconstructed three-dimensional grid model and a calibrated image as input, outputs a three-dimensional semantic grid model, each patch is labeled with a semantic label, and different colors represent different categories.

The specific operation is as follows:

step 1: determining the number of semantic segmentation categories and labeling data;

number of semantic segmentation classes: class 4, label 0-3 (representing other classes, buildings, roads, vegetation, respectively); labeling data: performing semantic segmentation and annotation on a small number of images by using Labelme data annotation software to generate json files;

step 2: training the labeled data through a semantic segmentation network to obtain a relatively ideal classification model;

and step 3: performing semantic segmentation on the image to obtain probability distribution of each category;

and 4, step 4: calculating the probability Pr (l) of each patch of the mesh grid corresponding to label_f＝l)

Ω_f，iRepresenting the projected area of the patch f in image I, I representing the entire image set;

and 5: and each patch in the mesh grid is assigned with a corresponding label, and MRF semantic fusion is carried out in a 3D space. The patch labeling problem is treated as an energy minimization problem on the MRF. Gibbs energy of MRF posterior probability distribution is

F is the entire set of patches, a is the set of neighboring patches,

V_f，q(l_f，l_q) Representing the geometrical constraint of the abutment surfaces (f, q).

Minimizing the energy E through an alpha-expansion algorithm, and generating a semantic three-dimensional model, wherein each patch has a semantic label;

step 6: and once the 3D semantic tags are obtained, the batch image selection can be used as a supervisor to measure the segmentation quality of each image, help to select valuable images for annotation, perform semantic annotation on a large-scale scene three-dimensional model, and greatly save the annotation cost by actively selecting the images for annotation.

Wherein SfM is formed by horizontally and vertically interleaving multiple channels, each channel provides 8Gbps switching capability (super player 720 provides 20Gpbs per channel), and the maximum advantage of matrix switching is to allow multiple non-conflicting switches to be performed simultaneously and support point-to-multipoint (Multicast) switching.

The MVS is a substrate, which uses two 14Mhz Motorola 68000 CPUs, and can achieve a resolution of 320 × 224 (maximum color number 65,536 colors, 4096 colors on-screen display), the sound processing chip is Z80A, there are 8 channels FM synthesis sound source and 7 channels digital stereo sound source (PSG & PCM), the system RAM is 7MB (56Mbits), and the maximum capacity of the cassette is 42MB (330 Mbits).

The semantics are divided into tasks in computer vision, in the process, different parts in the visual input are divided into different categories according to the semantics, and through semantic understanding, each category has certain practical significance.

In the MRF optimization in S2, variable weight parameters are introduced into a conventional MRF image segmentation algorithm to connect the marker field model and the feature field model, so that a balance is formed between the two models, a segmentation result that can maintain image edges, image important details, and region consistency is obtained, then an edge penalty function is introduced at the edges in a self-adaptive manner, the contribution of energy of a potential function to an energy function is adjusted, blurring of the edges during segmentation is reduced, and the positioning accuracy of the edges is improved.

The figures 2 and 3 of the invention are only schematic in function, and the details of the specific objects in the figures have no direct effect on the implementation of the technical scheme of the invention and do not influence the disclosure of the scheme.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A fine semantic annotation method based on a large-scale scene three-dimensional model is characterized by comprising the following steps: the following steps are performed iteratively under an Active Learning (Active Learning) framework:

s1 performs semantic segmentation network training on CNNs using an ever-expanding set of labeled images. Then using the trained CNN to obtain a pixel-level semantic label of the unlabeled image;

2. The method for fine semantic annotation based on the large-scale scene three-dimensional model according to claim 1, wherein: the method takes an SfM and MVS reconstructed three-dimensional grid model and a calibrated image as input, outputs a three-dimensional semantic grid model, each surface patch is attached with a semantic label, and different colors represent different categories.

3. The method for fine semantic annotation based on the large-scale scene three-dimensional model according to claim 1, wherein: the SfM is formed by horizontally and vertically interleaving a plurality of channels, each channel provides 8Gbps switching capability (super 720 provides 20Gpbs per channel), and the maximum advantage of matrix switching is to allow a plurality of non-conflicting exchanges to be performed simultaneously and support point-to-multipoint (Multicast) exchange.

4. The method for fine semantic annotation based on the large-scale scene three-dimensional model according to claim 1, wherein: the MVS is a substrate that uses two 14Mhz Motorola 68000 CPUs, can achieve a resolution of 320x224 (65,536 colors maximum color, 4096 colors on-screen display), the sound processing chip is Z80A, has an 8-channel FM composite sound source and a 7-channel digital stereo sound source (PSG & PCM), the system RAM is 7MB (56Mbits), and the maximum capacity of the cassette is 42MB (330 Mbits).

5. The method for fine semantic annotation based on the large-scale scene three-dimensional model according to claim 1, wherein: the semantics are divided into tasks in computer vision, in the process, different parts in the visual input are divided into different categories according to the semantics, and through semantic understanding, each category has certain realistic significance.

6. The method for fine semantic annotation based on the large-scale scene three-dimensional model according to claim 1, wherein: in the MRF optimization in S2, variable weight parameters are introduced into a conventional MRF image segmentation algorithm to connect the marker field model and the feature field model, so that a balance is formed between the two models, a segmentation result that can maintain image edges, image important details, and region consistency is obtained, then an edge penalty function is introduced at the edges in a self-adaptive manner, the contribution of energy of a potential function to an energy function is adjusted, blurring of the edges during segmentation is reduced, and the edge positioning accuracy is improved.