CN116993752A

CN116993752A - Semantic segmentation method, medium and system for live-action three-dimensional Mesh model

Info

Publication number: CN116993752A
Application number: CN202311258002.6A
Authority: CN
Inventors: 陈浩; 资文杰; 李军; 伍江江; 李沛秦; 杜春; 彭双; 熊伟; 贾庆仁; 杨飞; 景宁; 钟志农
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-11-03
Anticipated expiration: 2043-09-27
Also published as: CN116993752B

Abstract

The application relates to a semantic segmentation method, medium and system for a real-scene three-dimensional Mesh model, which are characterized in that geometric characteristics are obtained by calculating the real-scene three-dimensional Mesh model, the geometric characteristics are extracted, the characteristics are weighted by using an attention mechanism, the characteristics are subjected to diffusion treatment for a plurality of times by using a residual error network to obtain the characteristics of all points of the real-scene three-dimensional Mesh model, finally triangular patches of the model are classified according to the characteristics of all the points, semantic segmentation of each triangular patch of the real-scene three-dimensional Mesh model is output, end-to-end processing of the real-scene three-dimensional Mesh model is directly realized, the segmentation prediction capability of the real-scene three-dimensional Mesh model is improved, and the segmentation effect of the real-scene three-dimensional Mesh model is improved.

Description

Semantic segmentation method, medium and system for live-action three-dimensional Mesh model

Technical Field

The application relates to the field of live-action three-dimensional image processing, in particular to a semantic segmentation method, medium and system for a live-action three-dimensional Mesh model.

Background

With the development of computer science and technology and the emergence of excellent sensors, the aerial photographing technology is very mature, and a large amount of live-action three-dimensional data can be acquired. In the geography information system discipline, the Mesh model in the real-scene three-dimension refers to a scene or an object in the real world represented in a three-dimension model form, and the data in the real world can be acquired through various modes such as laser scanning, oblique photogrammetry and the like and processed into the three-dimension model, so that the Mesh model in the three-dimension real scene with extremely strong sense of reality is formed. The Mesh model in the live-action three-dimension has wide application prospect in the fields of military decision making, smart cities, digital earth, engineering supervision, hydraulic engineering, virtual reality and the like.

The Mesh model in the real three-dimensional is essentially textured Mesh data, and consists of triangular patches and vertexes, wherein each triangular patch contains a large number of characteristics, such as vertex indexes, texture coordinates, normal vectors and the like, and each triangular patch is unstructured space three-dimensional data which contains a large number of space geometric information, space topology information and the like due to different sizes of the triangular patches. At present, a large amount of Mesh model data in three dimensions of the live-action only stays in the stage of visual display and accelerated reading, and analysis is very rare, so that massive data cannot exert the maximum potential, and research and analysis of the three-dimensional data are very needed.

Recently, a small amount of research on semantic segmentation of a Mesh model in a live-action three-dimensional is carried out by adopting a two-stage method, wherein the first stage firstly carries out planar and non-planar sensitive over-segmentation on the Mesh model in the live-action three-dimensional to generate a super-surface, wherein the super-surface is an adjacent triangular surface with similar texture, color, direction, triangular surface patch density and other characteristics, the process is called over-segmentation, the second stage calculates the relation between the super-surface and the super-surface, carries out characteristic extraction on the super-surface, and finally carries out super-surface classification, thereby realizing the semantic segmentation of the Mesh model in the live-action three-dimensional.

However, the first stage of the existing Mesh model semantic segmentation method in the live-action three-dimension is an overstock process, the quality of the result is not known, the overstock result is critical to the second step of super-face classification, once the overstock result is not ideal, the precision of Mesh model semantic segmentation in the whole live-action three-dimension is very low, the average cross-over ratio and F1 score of the existing method are low, and the technical problems of insufficient recognition rate of sample imbalance and small objects exist in the Mesh model semantic segmentation prediction capability.

Disclosure of Invention

Based on the above, it is necessary to provide a semantic segmentation method, medium and system for a three-dimensional Mesh model of a live-action, so as to improve the semantic segmentation prediction capability of the Mesh model in the three-dimensional of the live-action.

In order to achieve the above object, the embodiment of the present invention adopts the following technical scheme:

on one hand, the embodiment of the invention provides a semantic segmentation method for a live-action three-dimensional Mesh model, which comprises the following steps:

acquiring a real-scene three-dimensional Mesh model to be segmented; the real-scene three-dimensional Mesh model consists of a triangular surface patch and a vertex;

calculating to obtain geometric features of the live-action three-dimensional Mesh model;

Extracting the characteristic of the geometric characteristic for N times which are mutually independent to obtain N information characteristics with the same dimension; n is a positive integer;

1 information feature in N information features is used as an original feature, a weighted feature is obtained by processing the rest N-1 information features by using an attention mechanism, and an intermediate feature is obtained by summing the original feature and the weighted feature; the attention mechanism is to weight the information characteristics through the combination of vector operation methods;

performing M times of diffusion on the intermediate features and the geometric features by using a residual error network to obtain features of all vertexes of the real-scene three-dimensional Mesh model;

and according to the characteristics of all vertexes of the real-scene three-dimensional Mesh model, classifying and predicting each triangular surface patch of the real-scene three-dimensional Mesh model, and outputting a semantic segmentation result of each triangular surface patch of the real-scene three-dimensional Mesh model.

In one embodiment, the geometric features of the live-action three-dimensional Mesh model include: laplacian matrixQuality matrix->Gradient features in the X-axis direction->Y-axis gradient feature->Characteristic root->And feature vector->Wherein:

laplacian matrixGradient features in the X-axis direction->And Y-axis gradient feature->All are of the size +.>V is the number of vertices in the live-action three-dimensional Mesh model;

Quality matrixIs a vector of dimension V;

feature rootAnd feature vector->Is a vector of the same dimension.

In one embodiment, the process of extracting the characteristic of the geometric characteristic for N times independently comprises the following steps:

and extracting the geometrical characteristics N times of mutually independent characteristics by adopting a linear connection method, wherein the linear connection input dimension is the number of vertexes in the live-action three-dimensional Mesh model.

In one embodiment, the processing of the remaining N-1 information features to obtain weighted features using an attention mechanism includes:

taking 1 information feature in N-1 information features as a first multiplier vector;

vector multiplication and normalization are carried out on the other N-2 information features, so that a product vector is obtained;

calculating the product vector by using a normalized exponential function softmax to obtain a second multiplier vector;

and multiplying the first multiplier vector by the second multiplier vector to obtain a weighting characteristic.

In one embodiment, the step of using the residual network to diffuse the intermediate features and the geometric features M times to obtain features of all vertices of the live-action three-dimensional Mesh model includes:

outputting a first vertex feature according to the geometric feature and the intermediate feature;

using a learning diffusion layer to learn and train the geometric features and the intermediate features, and outputting second vertex features according to training results;

Calculating the spatial gradient of each vertex of the live-action three-dimensional Mesh model according to the second vertex characteristics, and outputting third vertex characteristics according to the spatial gradient;

performing multi-layer sensing calculation on the first vertex feature, the second vertex feature and the third vertex feature, and outputting the number of segmentation categories;

obtaining all vertex characteristics of a first diffusion process according to the first vertex characteristics and the segmentation class number;

returning to the step of outputting the first vertex characteristics to the steps according to the geometric characteristics and the intermediate characteristics by taking all the vertex characteristics in the first diffusion process as input characteristics until all the vertex characteristics in the Mth diffusion process are obtained; m is a positive integer;

and outputting all the vertex characteristics of the Mth diffusion process as the characteristics of all the vertices of the real-scene three-dimensional Mesh model.

In one embodiment, a process for learning and training geometric and intermediate features using a learning diffusion layer includes:

the learning diffusion layer trains geometric features and intermediate features through different feature channels; each feature corresponds to an independent feature channel, and the learning time of the feature channel is set independently.

In one embodiment, a general acceleration method is used to accelerate the learning training process of the learning diffusion layer.

In one embodiment, the step of performing multi-layer perceptual computation on the first vertex feature, the second vertex feature, and the third vertex feature, and outputting the number of segmentation classes includes:

and performing multi-layer sensing calculation on the first vertex feature, the second vertex feature and the third vertex feature by using a linear connection mode, and outputting the number of the segmentation categories.

In one embodiment, the diffusion process is performed M times, each diffusion having a different range size.

In one embodiment, the accuracy of the semantic segmentation method is verified by a loss function, which is:

；

wherein the method comprises the steps ofFor cross entropy loss, < >>Regularization loss for important part, +.>Is an index loss.

In one aspect, an embodiment of the present invention further provides a computer readable storage medium, where a code program is stored, where the code program when executed by a processor implements the steps of any one of the foregoing semantic segmentation methods for a real-scene three-dimensional Mesh model.

On the one hand, the embodiment of the invention also provides a semantic segmentation system of the live-action three-dimensional Mesh model, which comprises a model acquisition component, a geometric feature calculation component, a feature extraction component, an attention mechanism component, a diffusion component and a classification output component;

The model acquisition component is used for acquiring a real-scene three-dimensional Mesh model to be segmented; the real-scene three-dimensional Mesh model consists of a triangular surface patch and a vertex;

the geometric feature calculation component is used for calculating and obtaining geometric features of the live-action three-dimensional Mesh model;

the feature extraction component is used for extracting the feature of the geometric feature for N times which are mutually independent, and the linear connection input dimension is the number of vertexes in the live-action three-dimensional Mesh model, so that N information features with the same dimension are obtained; n is a positive integer;

the attention mechanism component is used for taking 1 information feature in the N information features as an original feature, processing the rest N-1 information features by using an attention mechanism to obtain a weighted feature, and summing the original feature and the weighted feature to obtain an intermediate feature; the attention mechanism is to weight the information characteristics through the combination of vector operation methods;

the diffusion component is used for performing M times of diffusion on the intermediate features and the geometric features by using a residual error network to obtain features of all vertexes of the real-scene three-dimensional Mesh model;

the classification output component is used for carrying out classification prediction on each triangular patch of the real three-dimensional Mesh model according to the characteristics of all vertexes of the real three-dimensional Mesh model and outputting a semantic segmentation result of each triangular patch of the real three-dimensional Mesh model.

In one embodiment, the diffusion assembly comprises M diffusion modules, wherein each diffusion module comprises a direct calculation module, a spatial diffusion module, a spatial gradient feature module, a multi-layer perceptron and a feature output module;

the direct calculation module of each diffusion module is used for outputting a first vertex characteristic according to the geometric characteristic and the intermediate characteristic;

the space diffusion module of each diffusion module is used for learning and training the geometric features and the intermediate features by using the learning diffusion layer and outputting second vertex features according to training results;

the spatial gradient feature module of each diffusion module is used for calculating the spatial gradient of each vertex of the real-scene three-dimensional Mesh model according to the second vertex feature and outputting a third vertex feature according to the spatial gradient;

the multi-layer perceptron of each diffusion module is used for carrying out multi-layer perception calculation on the first vertex characteristics, the second vertex characteristics and the third vertex characteristics and outputting the number of the division categories;

the feature output module of each diffusion module is used for obtaining all vertex features of the first diffusion process according to the first vertex features and the segmentation class number;

when the diffusion module obtains all point characteristics of the diffusion processing for the Mth time, all point characteristics of the diffusion processing for the Mth time are output as characteristics of all points of the real-scene three-dimensional Mesh model.

One of the above technical solutions has the following advantages and beneficial effects:

according to the semantic segmentation method for the real-scene three-dimensional Mesh model, the geometric features are obtained through calculation of the real-scene three-dimensional Mesh model, feature extraction is carried out on the geometric features, the features are weighted by using an attention mechanism, the features are subjected to diffusion treatment for a plurality of times by using a residual network, the features of all points of the real-scene three-dimensional Mesh model are obtained, finally triangular patches of the model are classified according to the features of all the points, and the semantic segmentation result of each triangular patch of the real-scene three-dimensional Mesh model is output. The method breaks through the existing two-stage segmentation mode of 'over segmentation' and 'over-surface classification', realizes the end-to-end processing of the live-action three-dimensional Mesh model directly, improves the segmentation prediction capability of the live-action three-dimensional Mesh model, completes the direct construction of abstract features by introducing an attention mechanism, improves the comprehensiveness of small-problem segmentation by a diffusion process, and greatly improves the semantic segmentation prediction capability of the live-action three-dimensional Mesh model.

Drawings

Fig. 1 is a flowchart of steps of a semantic segmentation method of a real-scene three-dimensional Mesh model in an embodiment;

Fig. 2 is a schematic diagram of a real-scene three-dimensional Mesh model obtained in an embodiment;

fig. 3 is a schematic view of the diffusion range of each diffusion process of a semantic segmentation method of a real-scene three-dimensional Mesh model in an embodiment; wherein (a) is the minimum range of diffusion, (b) is the second small range of diffusion, (c) is the third small range of diffusion, and (d) is the maximum range of diffusion;

fig. 4 is a semantic segmentation system diagram of a real-scene three-dimensional Mesh model in an embodiment;

fig. 5 is a schematic diagram of an operation framework of a semantic segmentation system of a real-scene three-dimensional Mesh model in an embodiment;

fig. 6 is an effect comparison diagram of a semantic segmentation method (UMD) using a real-scene three-dimensional Mesh model with a conventional segmentation model in one embodiment, where (a) is a model original diagram, (b) is an MRF-RF model segmentation effect diagram, (c) is an MLP model segmentation effect diagram, (d) is a SUM-RF model segmentation effect diagram, (e) is a UMD segmentation effect diagram, and (f) is a true value diagram.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In addition, the technical solutions of the embodiments of the present application may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, it should be considered that the technical solutions are not combined, and are not within the scope of protection claimed by the present application.

The application provides a semantic segmentation method (UMD) of a live-action three-dimensional Mesh model, which is shown in figure 1 and comprises the following steps:

101: acquiring a real-scene three-dimensional Mesh model to be segmented; the real-scene three-dimensional Mesh model consists of a triangular surface patch and a vertex.

It will be appreciated that the Mesh model in live-action three-dimensions is essentially structured Mesh data, as shown in fig. 2, and is composed of triangular patches, each of which has three vertices, and each of which contains a large number of characteristics, such as vertex indices, texture coordinates, normal vectors, etc., and is unstructured spatial three-dimensional data, which contains a large amount of spatial geometric information, spatial topological information, etc., due to the different sizes of each of the triangular patches.

102: and calculating to obtain the geometric characteristics of the live-action three-dimensional Mesh model.

It will be appreciated that the process of obtaining geometric features using computation is a process of extracting geometric characteristics of a model by analyzing geometric attributes (such as vertices, edges, faces, etc.) in a Mesh model, and that geometric features may include geometric information of the size, shape, curvature, etc. of the model, for distinguishing and describing different models.

103: extracting the characteristic of the geometric characteristic for N times which are mutually independent to obtain N information characteristics with the same dimension; n is a positive integer.

It will be appreciated that N independent feature extractions of geometric features are performed, meaning that the network of N extracted features are independent and do not share parameters. In this example, the feature extraction method adopts a linear connection mode, and the feature extraction is performed for 4 times, and the feature dimensions of the output information are 128 dimensions.

104: 1 information feature in N information features is used as an original feature, a weighted feature is obtained by processing the rest N-1 information features by using an attention mechanism, and an intermediate feature is obtained by summing the original feature and the weighted feature; the attention mechanism is to weight the information features by a combination of vector operation methods.

It will be appreciated that in this example, 1 of the 4 information features is treated as an original feature, the remaining 3 information features are processed using the attention mechanism, to an intermediate featureTherefore, the output characteristics can be ensured to contain original characteristics and are weighted, and the similarity among the characteristics is conveniently compared. The combination of vector operation methods may be a conventional linear operation or may be an operation on a vector using a function.

105: and performing M times of diffusion on the intermediate features and the geometric features by using a residual error network to obtain the features of all vertexes of the real-scene three-dimensional Mesh model.

It can be appreciated that a Residual Network (ResNet) is a deep neural Network architecture that solves the problems of gradient extinction and gradient explosion during deep learning model training by introducing Residual connections.

It can be understood that the effect of diffusion is to enrich the feature information of the model vertices, and in this example, 4 times of diffusion are performed to obtain the feature of each vertex in the Mesh model.

106: and according to the characteristics of all vertexes of the real-scene three-dimensional Mesh model, classifying and predicting each triangular surface patch of the real-scene three-dimensional Mesh model, and outputting a semantic segmentation result of each triangular surface patch of the real-scene three-dimensional Mesh model.

It can be understood that by means of the features of all the vertices of the real-scene three-dimensional Mesh model, the features of all the vertices refer to the features of all the vertices of the triangular patches in the model, including the topological information such as the positions, colors, normal directions and the like of the vertices, and by means of feature analysis of each vertex of the model and combining the corresponding relationship between each vertex and the triangular patch, the features of each triangular patch in the model can be further known, and the triangular patches can be further classified.

It will be understood that semantic segmentation refers to the classification of each element (pixel, vertex, patch, etc.) in the image, vertex cloud, mesh, etc. data into different categories, in this example, semantic segmentation is performed for each triangle patch in a live-action three-dimensional Mesh model, and each triangle patch may be assigned to a corresponding category, e.g., belonging to a building, ground, tree, etc., by a classification process.

The semantic segmentation result of each triangular patch is output, and the classification information is summarized and displayed on a model. Thus, the user can know the category to which each triangular patch belongs by looking at the model, and the semantic structure of the object can be understood.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, the geometric features of the live-action three-dimensional Mesh model include: laplacian matrixMass matrixGradient features in the X-axis direction->In the Y-axis directionGradient characteristics->Characteristic root->And feature vector->Wherein:

Quality matrixIs a vector of dimension V;

feature rootAnd feature vector->Is a vector of the same dimension.

It will be appreciated that in this example, the feature rootAnd feature vector->Is 128 dimensions.

In one embodiment, the process of extracting the geometric features N times of mutually independent features includes: and extracting the geometrical characteristics N times of mutually independent characteristics by adopting a linear connection method, wherein the linear connection input dimension is the number of vertexes in the live-action three-dimensional Mesh model.

It is understood that a linear connection refers to a linear layer in a neural network that multiplies input data with a weight matrix to extract a linear relationship in the data. In this example, the linear connection is used to process geometric features in a three-dimensional Mesh model.

In one embodiment, the process of processing the remaining N-1 information features to obtain weighted features using an attention mechanism includes:

It can be understood that in this example, n=4, 1 information feature is taken as the first multiplier vector, the remaining 3 feature vectors are assumed to be a, b, and c, then the multiplication of the vectors is performed on a and b, the result is normalized, the normalization exponential function softmax is used to operate on s, and finally c and s are multiplied to obtain the final feature vector a after the attention mechanism as the weighting feature.

In one embodiment, the step of using the residual network to diffuse the intermediate features and the geometric features M times to obtain features of all vertices of the real-scene three-dimensional Mesh model includes:

It will be appreciated that the diffusion process is defined by the thermal equation, which is as follows

（1）

Wherein the method comprises the steps ofRepresenting Laplace operation calculation, diffusion may be used +.>Representing that the operator is applied to a certain initial distribution +.>And generates a diffusion profile->The method comprises the steps of carrying out a first treatment on the surface of the The effect can be defined as +.>Where exp represents an exponential operator, diffusion is an increasingly global smoothing process over time: for->，/>Is an identity mapping, and when +>When it approaches the average over that area.

Spatial feature propagation on a surface using thermal equations, the basic principle of which ensures that the result is essentially unchanged in the manner of surface sampling or meshing, for discretized diffusion, the method of the invention isSubstitution with weak Laplace matrix +.>And quality matrix->Wherein->Is a positive semi-definite sparse matrix, +. >With opposite sign convention such that +.>。/>And->The number of (a) is usually +.>On Mesh models in live-action three dimensions, using the cotan-Laplace matrix, it is very common in geometry processing applications that the matrix has also been defined as a voxel grid.

It will be appreciated that the learning diffusion layer can well extract the spatial semantic feature vector S of the Mesh model in live-action three dimensions, that the spatial semantic feature vector S can progressively include spatial topological relations, that this simple network operates on a scalar value of overall fixed channel width D, d=128, that each diffusion process diffuses features, builds spatial gradient features, and performs multi-layer perceptual computation on the results, including residual connection to stabilize the training, and that the linear layer converts the input and output dimensions into the desired dimensions. Where appropriate, the results of the Mesh model in the live-action three-dimensional can be calculated by averaging the network outputs of adjacent vertices, either softmax for segmentation or global average for classification followed by softmax, without the need to use spatial convolution or pooling hierarchies on the surface, avoiding these potentially complex operations helping to preserve the simplicity and robustness of the diffusion process.

It will be appreciated that in this example, the diffusion layer is learnedThe parameter shape and size of (2) are not changed and are +.>It will learn the time +.>Characteristic channel->Diffusion is performed. In the network, < >>Independently applied to each characteristic channel, each channel has a separate learning time +.>. Learning diffusion parameters enables the network to continuously optimize the spatial support range, from purely local to fully global, even selecting different receptive fields for each feature, as shown in fig. 3, fig. 3 (a) for the minimum range of diffusion, fig. 3 (b) for the second small range of diffusion, fig. 3 (c) for the third small range of diffusion, and fig. 3 (d) for the maximum range of diffusion; problems such as manually selecting the support radius or pooling level size of the convolution are avoided. From a deep learning perspective, diffusion can be considered as a smooth mean pooling operationIn turn, it has a fundamental geometrical meaning, its support can range from purely local to fully global by choosing the diffusion time, and it is differentiable with respect to the diffusion time, allowing the spatial support to be optimized automatically as a network parameter.

The present example uses a closed form expression of diffusion in the basis of a spectrally accelerated low frequency Laplacian feature function, which can be calculated for any time by element-wise exponential operation once the feature function basis has been pre-calculated Evaluating the diffusion, clipping the diffusion to a low frequency base will produce some approximation errors for the weak Laplace matrix +.>And quality matrix->Feature vector->Is a solution to the following equation:

（2）

for the front partMinimum amplitude characteristic value->The normalization processing is performed so that the following conditions are satisfied:

（3）

let equation 3 showIs a stacked matrix of feature vectors, these vectors being relative to +.>Is an orthogonal basis. The diffusion layer is then evaluated by projecting it onto the spectral base, evaluating the point-by-point diffusion, and projecting the result back to the vertex。

（4）

In equation 4Representing the hadamard product, i.e. the multiplication of the corresponding positions. This operation can be efficiently evaluated by dense linear algebraic operations such as element-by-element exponentiation and matrix multiplication, and is for +.>And->Is easily reducible.

The diffusion layer enables information to propagate between different triangular patches of shape, but it only supports a radially symmetric filter about a point. The last building block in the method creates a larger filter space by computing the spatial gradient of the vertex signal values. Specifically, the inner product between pairs of feature gradients for each vertex constructs features, which are then computed using learned scaling or rotation.

The computation of the gradient represents the spatial gradient of the scalar function on the surface as a two-dimensional vector in each vertex-tangent space. These gradients can be calculated by standard procedures, selecting a normal vector (as input or local estimate) at each vertex and projecting the neighborhood into the tangent plane, i.e. the 1-ring neighbor on the mesh. Gradients are then calculated in the tangent plane by least squares approximation of the function values in the neighborhood. These gradient operators can be assembled into a sparse matrix on each vertexWherein->Representation->And->Real value vertex vector +.>Applied to it to generate a gradient tangent vector at each vertex. The matrix is feature independent and can be pre-calculated for each shape. A convenient symbology for tangential vectors using complex numbers as the basis for any reference, if the normal directions are coincident, the imaginary axis is selected as the right hand line with respect to the surface normal.

The resulting pair-wise product is learned. By learning the inner product between the local two-dimensional gradients of each vertex, and combining the spatial gradients of each vertex of each channel, the information-rich scalar features are learned. The inner product is invariant to rotation of the coordinate system, so these features are invariant to the choice of tangential basis at the vertices, as expected. Taken together, give A set of scalar characteristic channels +.>Firstly, its spatial gradient is characterized by +.>This is the local 2D gradient vector for each vertex.

（5）

Then, at each vertexWhere all will beLocal gradients of channels are stacked to form +.>And obtain real value feature +.>The following is shown:

（6）

in equation 6Representing a learnable +.>And the real part Re is taken after conjugation thereof, for convenience in calculating the dot product between 2D vectors. Thus, at the vertex->Output at->The individual components are given by the dot product:

（7）

in equation 7, tanh represents the tanh activation function, whose nonlinear transformation is not fundamental but contributes to the stability of training. The spatial gradient characteristics are calculated as equation 7.

The multi-layer perception calculation in the diffusion process is composed of 3 layers of linear full connection, and the output of the multi-layer perception calculation is the number of categories to be segmented.

（8）

in equation 8Represents Cross-entropy loss (Cross-entropy loss), +.>Representing regularization loss of the important part, +.>Indicating an index loss.

（9）

Wherein the method comprises the steps ofIndicating total number->Representing the number of categories >Representing a sign function, if the sample->The true category equals->Taking 1, otherwise taking 0,>representing observation sample->Belongs to category->Is used for the prediction probability of (1).

（10）

Wherein the method comprises the steps ofRepresenting the network layer in the diffusion module, +.>The weights representing each layer:

（11）

wherein the method comprises the steps ofRepresenting the cross ratio, +.>Represents f1 fraction->Representing recall.

In one aspect, an embodiment of the present invention provides a computer readable storage medium, on which a code program is stored, where the code program when executed by a processor implements the steps of any one of the foregoing semantic segmentation methods for a real-scene three-dimensional Mesh model.

It can be understood that the implementation of the semantic segmentation method of the live-action three-dimensional Mesh model does not need to change hardware equipment, and only needs to design and modify at a software level.

In some embodiments, in order to more intuitively and fully describe the foregoing semantic segmentation method (UMD: urban-Mesh-Diffusion) of the real-scene three-dimensional Mesh model, the following is an experimental verification example performed on the semantic segmentation method of the real-scene three-dimensional Mesh model, and the segmentation results of the conventional MRF-RF model, the MLP model, and the SUM-RF model are compared, as shown in table 1. It should be noted that, the experimental cases given in the present specification are only schematic, and are not the only limitation of the specific embodiments of the present invention, and those skilled in the art may adopt the above-mentioned provided semantic segmentation method for the real-scene three-dimensional Mesh model to achieve semantic segmentation for the real-scene three-dimensional Mesh model of different application scenarios under the schematic representation of the embodiments provided by the present invention.

The experiment was implemented using the python language and Pytorch framework, and the model was trained using NVIDIA RTX 3090 and CUDA 11.2 API models of NVIDIA due to the complexity of the network and the large computational effort of the loss function. The initial learning rate is 0.0001 when the experimental super-parameters are set, and the learning rate is reduced to 50% of the original learning rate every 50 rounds.

The experiment used an open source dataset SUM-Helsinki, which is the largest benchmark dataset of the semantic urban grid, covering approximately 4 square kilometers of Helsinki, finland, for a total of six object categories: the entire data set contains 64 maps, each covering an area of 250m x 250 meters using 40 maps (62.5% of the entire data set) as the training set, 12 maps (18.75%) as the test set and 12 other maps as the validation set.

Table 1UMD and results of the comparative model

On the one hand, as shown in fig. 4, the embodiment of the invention provides a semantic segmentation system of a live-action three-dimensional Mesh model, which comprises a model acquisition component, a geometric feature calculation component, a feature extraction component, an attention mechanism component, a diffusion component and a classification output component;

In this example, n=4, m=4, and the multi-layer perceptron is composed of 3-layer linear full-links, the output of which is the number of classes to be segmented. As shown in fig. 5, the diffusion assembly includes four diffusion modules, wherein each diffusion module includes a direct calculation module, a spatial diffusion module, a spatial gradient feature module, a multi-layer perceptron, and a feature output module;

Using a real-scene three-dimensional Mesh model semantic segmentation method (UMD: ubman-Mesh-Diffusion), and the effect pairs with the conventional segmentation model are shown in fig. 6, where fig. 6 (a) is a model original diagram, fig. 6 (b) is an MRF-RF model segmentation effect diagram, fig. 6 (c) is an MLP model segmentation effect diagram, fig. 6 (d) is a SUM-RF model segmentation effect diagram, fig. 6 (e) is a UMD segmentation effect diagram, and fig. 6 (f) is a true value diagram.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. The semantic segmentation method of the live-action three-dimensional Mesh model is characterized by comprising the following steps of:

acquiring a real-scene three-dimensional Mesh model to be segmented; the live-action three-dimensional Mesh model consists of a triangular surface patch and a vertex;

extracting the geometrical characteristics for N times independently to obtain N information characteristics with the same dimension; n is a positive integer;

1 information feature in N information features is used as an original feature, a weighted feature is obtained by processing the rest N-1 information features by using an attention mechanism, and the original feature and the weighted feature are summed to obtain an intermediate feature; the attention mechanism is used for weighting the information features through a combination of vector operation methods;

performing M times of diffusion on the intermediate features and the geometric features by using a residual error network to obtain features of all vertexes of the live-action three-dimensional Mesh model;

and according to the characteristics of all vertexes of the real three-dimensional Mesh model, classifying and predicting each triangular patch of the real three-dimensional Mesh model, and outputting a semantic segmentation result of each triangular patch of the real three-dimensional Mesh model.

2. The semantic segmentation method of a live-action three-dimensional Mesh model according to claim 1, wherein the geometric features of the live-action three-dimensional Mesh model comprise: laplacian matrix Quality matrix->Gradient features in the X-axis direction->Y-axis gradient feature->Characteristic root->And feature vector->Wherein:

the Laplace matrixGradient features in the X-axis direction->And Y-axis gradient feature->All are of the size +.>V is the number of vertices in the live-action three-dimensional Mesh model;

the quality matrixIs a vector of dimension V;

the characteristic rootAnd feature vector->Is a vector of the same dimension.

3. The semantic segmentation method of a live-action three-dimensional Mesh model according to claim 1, wherein the process of extracting the feature of the geometric feature for N times independently comprises:

4. The semantic segmentation method of a live-action three-dimensional Mesh model according to claim 1, wherein the processing of the rest N-1 information features by using an attention mechanism to obtain weighted features comprises the following steps:

taking 1 information feature in the N-1 information features as a first multiplier vector;

and multiplying the first multiplier vector with the second multiplier vector to obtain a weighting characteristic.

5. The semantic segmentation method of a real-scene three-dimensional Mesh model according to claim 1, wherein the step of performing M times of diffusion on the intermediate features and the geometric features by using a residual network to obtain features of all vertices of the real-scene three-dimensional Mesh model comprises:

returning all vertex characteristics of the first diffusion process to the step of outputting the first vertex characteristics to the number of the segmentation categories according to the geometric characteristics and the intermediate characteristics by taking all vertex characteristics of the first diffusion process as input characteristics until all vertex characteristics of an Mth diffusion process are obtained; m is a positive integer;

And outputting all the vertex characteristics of the Mth diffusion process as the characteristics of all the vertices of the live-action three-dimensional Mesh model.

6. The method for semantic segmentation of a real-scene three-dimensional Mesh model according to claim 5, wherein the process of learning and training the geometric features and the intermediate features by using a learning diffusion layer comprises the following steps:

the learning diffusion layer trains the geometric features and the intermediate features through different feature channels; each feature corresponds to an individual feature channel, the learning time of which is set individually.

7. The semantic segmentation method of a live-action three-dimensional Mesh model according to claim 6, wherein a general acceleration method is adopted to accelerate a learning training process of the learning diffusion layer.

8. The method for semantic segmentation of a real-scene three-dimensional Mesh model according to claim 7, wherein the step of performing multi-layer perceptual computation on the first vertex feature, the second vertex feature and the third vertex feature and outputting the number of segmentation classes comprises:

9. The semantic segmentation method of a real-scene three-dimensional Mesh model according to claim 5, wherein the range of each diffusion is different in size in the M diffusion processes.

10. The method for semantic segmentation of a live-action three-dimensional Mesh model according to claim 1, wherein the accuracy of the semantic segmentation method is verified by a loss function, the loss function being:

；

11. A computer-readable storage medium, on which a code program is stored, characterized in that the code program, when being executed by a processor, implements the steps of the real-scene three-dimensional Mesh model semantic segmentation method according to any one of claims 1 to 10.

12. The semantic segmentation system of the live-action three-dimensional Mesh model is characterized by comprising a model acquisition component, a geometric feature calculation component, a feature extraction component, an attention mechanism component, a diffusion component and a classification output component;

the model acquisition component is used for acquiring a real-scene three-dimensional Mesh model to be segmented; the live-action three-dimensional Mesh model consists of a triangular surface patch and a vertex;

The feature extraction component is used for extracting the geometric features for N times in mutually independent mode, and the linear connection input dimension is the number of vertexes in the live-action three-dimensional Mesh model to obtain N information features with the same dimension; n is a positive integer;

the attention mechanism component is used for taking 1 information feature in the N information features as an original feature, processing the rest N-1 information features by using an attention mechanism to obtain a weighted feature, and summing the original feature and the weighted feature to obtain an intermediate feature; the attention mechanism is used for weighting the information features through a combination of vector operation methods;

the diffusion component is used for performing M times of diffusion on the intermediate features and the geometric features by using a residual error network to obtain features of all vertexes of the live-action three-dimensional Mesh model;

the classification output component is used for carrying out classification prediction on each triangular patch of the live-action three-dimensional Mesh model according to the characteristics of all vertexes of the live-action three-dimensional Mesh model and outputting a semantic segmentation result of each triangular patch of the live-action three-dimensional Mesh model.

13. The live-action three-dimensional Mesh model semantic segmentation system according to claim 12, wherein the diffusion component comprises M diffusion modules, wherein each diffusion module comprises a direct computation module, a spatial diffusion module, a spatial gradient feature module, a multi-layer perceptron, and a feature output module;

The direct computation module of each diffusion module is used for outputting a first vertex feature according to the geometric feature and the intermediate feature;

the spatial diffusion module of each diffusion module is used for learning and training the geometric features and the intermediate features by using a learning diffusion layer, and outputting second vertex features according to training results;

the spatial gradient feature module of each diffusion module is used for calculating the spatial gradient of each vertex of the live-action three-dimensional Mesh model according to the second vertex feature and outputting a third vertex feature according to the spatial gradient;

the multi-layer perceptron of each diffusion module is used for carrying out multi-layer perception calculation on the first vertex characteristics, the second vertex characteristics and the third vertex characteristics and outputting the number of segmentation categories;

the feature output module of each diffusion module is used for obtaining all vertex features of a first diffusion process according to the first vertex features and the segmentation class number;

when the diffusion module obtains all point characteristics of the diffusion process for the Mth time, all point characteristics of the diffusion process for the Mth time are output as characteristics of all points of the real-scene three-dimensional Mesh model.