CN116958957A

CN116958957A - Training method of multi-mode feature extraction network and three-dimensional feature representation method

Info

Publication number: CN116958957A
Application number: CN202310938930.0A
Authority: CN
Inventors: 王昊为; 唐霁霁; 张荣升; 赵敏达; 李林橙; 赵增; 吕唐杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-10-27

Abstract

The invention provides a training method and a three-dimensional feature representation method of a multi-mode feature extraction network, comprising the following steps: acquiring a multi-modal training data set; the multi-modal training data set comprises a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree; respectively extracting a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set and a first text feature vector of the category text tree through an initial feature extraction network; determining, by the cross-modal joint condition modeling network, a joint modal feature vector based on the first image feature vector and the first text feature vector; training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector and the joint modal feature vector to obtain a target feature extraction network. The invention can solve the problems of information degradation, insufficient synergy and the like in the prior art.

Description

Training method of multi-mode feature extraction network and three-dimensional feature representation method

Technical Field

The invention relates to the technical field of neural networks, in particular to a training method of a multi-modal feature extraction network and a three-dimensional feature representation method.

Background

Characterization learning of three-dimensional models is becoming increasingly important in various practical applications such as augmented/virtual reality and autopilot. At present, the traditional scheme generally adopts modes of generating, self-supervising and the like to acquire additional three-dimensional data and uses the additional three-dimensional data for neural network training so as to realize characteristic representation of a three-dimensional model by using the neural network obtained by training, but the traditional scheme does not essentially improve the information content of the training data and cannot expand the upper limit of the three-dimensional characteristic representation capability.

To improve this technical problem, the related art proposes three-dimensional characterization learning using a two-dimensional alignment strategy, but there are two important limitations of the technology: (1) The three-dimensional representation, the single-view image and the rough text can cause the loss of critical space and depth information, so that information degradation is caused; (2) By aligning the three-dimensional representation with the image features and the text features, respectively, joint modeling of visual and linguistic modalities is omitted, resulting in insufficient synergy between the image features and the text features, and ultimately incomplete information utilization.

Disclosure of Invention

In view of the above, the present invention aims to provide a training method and a three-dimensional feature representation method for a multi-modal feature extraction network, which can solve the problems of information degradation, insufficient collaboration and the like existing in the prior art.

In a first aspect, an embodiment of the present invention provides a training method for a multi-modal feature extraction network, including:

acquiring a multi-modal training data set; the multi-modal training data set comprises a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree;

respectively extracting a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set and a first text feature vector of the category text tree through an initial feature extraction network;

determining, by a cross-modal joint condition modeling network, a joint modal feature vector based on the first image feature vector and the first text feature vector;

training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector and the joint modal feature vector to obtain a target feature extraction network.

In a second aspect, an embodiment of the present invention further provides a three-dimensional feature representing method, including:

acquiring a target three-dimensional model to be processed, a target multi-angle rendering image set corresponding to the target three-dimensional model and a target category text;

extracting a second point cloud feature vector of the target three-dimensional model, a second image feature vector of the target multi-angle rendering image set and a second text feature vector of the target category text through a target feature extraction network obtained through pre-training; the target feature extraction network is trained by the training method of any multi-mode feature extraction network provided in the first aspect;

And taking one or more of the second point cloud feature vector, the second image feature vector and the second text feature vector as a three-dimensional feature representation of the target three-dimensional model.

In a third aspect, an embodiment of the present invention further provides a training device of a multi-modal feature extraction network, including:

the first data acquisition module is used for acquiring a multi-mode training data set; the multi-modal training data set comprises a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree;

the first feature extraction module is used for respectively extracting a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set and a first text feature vector of the category text tree through an initial feature extraction network;

a joint feature determination module configured to determine, by a cross-modal joint condition modeling network, a joint modal feature vector based on the first image feature vector and the first text feature vector;

and the network training module is used for training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain a target feature extraction network.

In a fourth aspect, an embodiment of the present invention further provides a three-dimensional feature representing apparatus, including:

the second data acquisition module is used for acquiring a target three-dimensional model to be processed, a target multi-angle rendering image set corresponding to the target three-dimensional model and a target category text;

the second feature extraction module is used for extracting a second point cloud feature vector of the target three-dimensional model, a second image feature vector of the target multi-angle rendering image set and a second text feature vector of the target category text through a target feature extraction network obtained through pre-training; the target feature extraction network is trained by the training method of any multi-mode feature extraction network provided in the first aspect;

and the feature representation determining module is used for taking one or more of the second point cloud feature vector, the second image feature vector and the second text feature vector as a three-dimensional feature representation of the target three-dimensional model.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including a processor and a memory storing computer-executable instructions executable by the processor to implement the method of any one of the first and second aspects.

In a sixth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the method of any one of the first and second aspects.

According to the training method and device for the multi-modal feature extraction network, firstly, a multi-modal training data set comprising a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree is obtained, a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set and a first text feature vector of the category text tree are respectively extracted through an initial feature extraction network, then a joint modal feature vector is determined based on the first image feature vector and the first text feature vector through a cross-modal joint condition modeling network, and finally the initial feature extraction network is trained based on the first point cloud feature vector, the first text feature vector and the joint modal feature vector, so that a target feature extraction network is obtained. The method enriches the representation of visual and language modes by introducing the multi-angle rendering image set and the category text tree, so that the upper limit of the characteristic representation capability of the three-dimensional model is improved, and the problem of information degradation is solved; in addition, after the multi-mode feature vector is extracted through the initial feature extraction network, a cross-mode joint condition modeling network is utilized, and the joint mode feature vector is determined based on the first image feature vector and the first text feature vector, so that language knowledge is integrated into the visual mode to model the joint mode, and the problem of insufficient synergy is solved; and finally, training the initial feature extraction network by using the first point cloud feature vector, the first text feature vector and the joint mode feature vector, so that the target feature extraction network obtained by training obtains unified characterization of point cloud, text and images when the feature extraction is performed.

The three-dimensional feature representation method and device provided by the embodiment of the invention are characterized in that a target three-dimensional model to be processed, a target multi-angle rendering image set corresponding to the target three-dimensional model and a target category text are firstly obtained; then, a target feature extraction network is obtained through training by a training method of the multi-mode feature extraction network, and a second point cloud feature vector of a target three-dimensional model, a second image feature vector of a target multi-angle rendering image set and a second text feature vector of a target category text are extracted; and finally, one or more of the second point cloud feature vector, the second image feature vector and the second text feature vector are used as three-dimensional feature representation of the target three-dimensional model. According to the method, the target feature extraction network obtained through training by the training method of the multi-modal feature extraction network is utilized to extract multi-modal features of the target three-dimensional model, so that the point cloud features have uniform semantic expression with the image features and the text features, and the feature representation of the target three-dimensional model is better realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a training method of a multi-modal feature extraction network according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a network structure according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a three-dimensional feature representation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a target feature extraction network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a partial image point cloud regression result according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a training device of a multi-modal feature extraction network according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a three-dimensional feature representation apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described in conjunction with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, the characterization learning of the three-dimensional model is becoming more and more important in various practical applications such as enhancement/virtual reality and automatic driving. However, unlike the massive availability of image text pairs, these problems severely limit the development of three-dimensional understanding due to data scarcity and insufficient category characterization. Early work tended to focus only on the three-dimensional data itself, using a similar approach to generating and self-supervising to acquire some additional three-dimensional data. These methods often use hiding-generated ideas to enhance the expressive power of three-dimensional model features, that is, to hide portions of the point clouds, and then complete understanding of the overall model by predicting these hidden point clouds. However, these approaches do not essentially increase the information content of the training data, and do not extend the upper limit of the three-dimensional feature representation capability.

Existing work aims to improve three-dimensional characterization by using a large-scale visual language model (e.g., CLIP, contrastive Language-Image Pre-Training) to solve the problem of insufficient data. The basic principle of these methods is to align three-dimensional features with unified space of vision and language, thereby obtaining gains from the powerful zero-sample capability of the underlying model. These methods typically involve rendering images of a three-dimensional model from a specific angle, accompanied by a simple class label, and entering them into the CLIP. The three-dimensional features are then aligned with the vision-language space by means of contrast learning. This strategy of incorporating rich external information has been demonstrated to be effective in enhancing three-dimensional Understanding capabilities and to exhibit good transferability as demonstrated by ULIP (Unified Representation of Language, images, and Point Clouds for D rendering) and CG3D studies.

However, these methods mainly employ two-dimensional alignment strategies for three-dimensional characterization learning, fail to consider the unique features of three-dimensional models, and thus face two important limitations: 1) Information degradation: aligning the three-dimensional representation with the single-view image and the coarse text can result in critical spatial and depth information loss. For example, when inspecting single view images, the front side rendering of the aircraft lacks wing details, as does the back side rendering. Furthermore, from a textual point of view, the generic term "aircraft" is not sufficient to distinguish between a passenger aircraft and other types of aircraft, such as jet aircraft or bombers; 2) The synergy is insufficient: these methods align the three-dimensional representation with the image features and text features, respectively, ignoring joint modeling of visual and linguistic modalities. The problem complicates the optimization process of three-dimensional characterization, the optimization direction and the optimization degree of three-dimensional features, and finally results in incomplete information utilization.

Based on the above, the training method and the three-dimensional feature representation method of the multi-mode feature extraction network are provided, and the problems of information degradation, insufficient coordination and the like in the prior art can be solved.

For the convenience of understanding the present embodiment, first, a training method of a multi-modal feature extraction network disclosed in the present embodiment will be described in detail, referring to a flowchart of a training method of a multi-modal feature extraction network shown in fig. 1, the method mainly includes the following steps S102 to S108:

step S102, a multi-modal training data set is acquired.

The multi-modal training data set comprises a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree; the multi-angle rendering image set is a set of rendering images obtained by rendering the three-dimensional model according to a plurality of rendering angles, and the rendering images can comprise color rendering images and/or depth images; a category text tree may be understood as a hierarchical tree structure, which includes a plurality of parent category texts, each of which includes a plurality of child category texts.

Step S104, respectively extracting a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set and a first text feature vector of the category text tree through the initial feature extraction network.

The initial feature extraction network comprises a point cloud feature extraction sub-network and a Structured Multi-mode data organization sub-network (SMO for short), a skeleton network (such as a 3D encoder) of point clouds is arranged in the point cloud feature extraction sub-network, the SMO network comprises an image unit and a text unit, an image encoder is arranged in the image unit, and a text encoder is arranged in the text unit.

In one embodiment, a first point cloud feature vector of a three-dimensional model may be extracted using a 3D encoder disposed within a point cloud extraction sub-network; in addition, the image unit is utilized to carry out proximity sampling on the multi-angle rendering image set so as to obtain a multi-angle rendering image subset, and a first image feature vector of each rendering image in the multi-angle rendering image subset is extracted through the image encoder; in addition, a text unit is utilized to determine a target parent category text and a target sub-category text to which the three-dimensional model belongs based on the category text tree, and a text encoder is utilized to extract first text features of the target sub-category text.

Step S106, determining a joint mode feature vector based on the first image feature vector and the first text feature vector through the cross-mode joint condition modeling network.

The cross-modal Joint condition modeling network (JMA) is used to integrate language knowledge into a visual modality to model a Joint modality.

In one embodiment, matrix multiplication can be performed on the first image feature vector and the first text feature vector through a JMA network to obtain a confusion matrix of probability distribution of the text relative to the first image feature vectors with different rendering angles, and then the confusion matrix is used for weighting the first image feature vectors with different rendering angles, so that a text-image cross-mode joint condition probability feature (simply referred to as joint mode feature vector) after text intervention can be obtained.

Step S108, training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain a target feature extraction network.

In one embodiment, the training process includes two tasks: (1) Determining a sub-loss value based on the feature vector of any two modes of the first point cloud feature vector, the first text feature vector and the joint mode feature vector, and weighting a plurality of sub-loss values to obtain a first target loss value; (2) And clustering the point cloud data of the three-dimensional model by utilizing the target parent category, and further determining a second target loss value based on the clustered point cloud data. Further, training the initial feature extraction network based on the first loss value and the second loss value, and obtaining a required target feature extraction network when training is stopped.

According to the training method of the multi-mode feature extraction network, which is provided by the embodiment of the invention, the representation of the visual and language modes is enriched by introducing the multi-angle rendering image set and the category text tree, so that the upper limit of the feature representation capability of the three-dimensional model is improved, and the problem of information degradation is solved; in addition, after the multi-mode feature vector is extracted through the initial feature extraction network, a cross-mode joint condition modeling network is utilized, and the joint mode feature vector is determined based on the first image feature vector and the first text feature vector, so that language knowledge is integrated into the visual mode to model the joint mode, and the problem of insufficient synergy is solved; and finally, training the initial feature extraction network by using the first point cloud feature vector, the first text feature vector and the joint mode feature vector, so that the target feature extraction network obtained by training obtains unified characterization of point cloud, text and images when the feature extraction is performed.

The three-dimensional feature representation method (JM 3D) based on multi-mode feature aggregation is a unified, simple and effective method, and aims to strengthen feature representation of a three-dimensional model through joint learning of data of different modes, such as pictures, texts and the like, so that the downstream problems of three-dimensional model detection, picture three-dimensional model regression and the like are solved. Previous methods often use only three-dimensional models, limited by a small number of three-dimensional datasets, which are difficult to achieve with considerable results. In consideration of the characteristics, the three-dimensional feature representation method based on multi-modal feature aggregation provides two innovative module designs, namely a structured multi-modal data organization sub-network and a cross-modal joint condition modeling network. The two modules respectively start from an organization data source and a modeling basic principle, and better three-dimensional characteristic representation is obtained.

For easy understanding, the embodiment of the invention provides a network structure schematic diagram shown in fig. 2, fig. 2 illustrates an SMO sub-network, the SMO sub-network further comprises an image unit and a text unit, an image encoder is arranged in the image unit, and a text encoder is arranged in the text unit; FIG. 2 also illustrates a JMA network for matrix multiplication and point-by-point multiplication; fig. 2 also illustrates a 3D encoder for extracting point cloud feature vectors of a three-dimensional model.

Based on fig. 2, the embodiment of the invention provides a specific implementation manner of a training method of a multi-mode feature extraction network.

For the foregoing step S104, when the step of extracting the first point cloud feature vector, the first image feature vector, and the first text feature vector is performed, the following steps 1 to 2 may be referred to:

and step 1, extracting a first point cloud feature vector of the three-dimensional model through a point cloud feature extraction sub-network. In one embodiment, the step of extracting the first point cloud feature vector of the three-dimensional model may be performed as follows steps 1.1 to 1.3:

and 1.1, determining point cloud data corresponding to the three-dimensional model. In practical application, the embodiment of the invention selects a specific format using the point cloud as a three-dimensional model in consideration of the universality and the differentiability of the three-dimensional format. Wherein, the three-dimensional model is represented as a plurality of points discrete in space, each point is represented by a three-dimensional coordinate, and the point cloud data is represented as C.

And 1.2, carrying out data enhancement processing on the point cloud data to obtain enhanced point cloud data. In one embodiment, certain data enhancements are applied to the point cloud data, such as random rotation, random scaling, point random discarding, point cloud coordinate normalization, etc., and the expression of the data enhancements is as follows:

C′＝Data Arguement(C)；

wherein C is point cloud Data, C' is enhanced point cloud Data, and Data Arguement is Data enhancement processing.

And 1.3, extracting a first point cloud feature vector of the enhanced point cloud data through a point cloud feature extraction sub-network. In one embodiment, the first point cloud feature vector of the enhanced point cloud data may be extracted through a skeletal network of point clouds. Such as shown in fig. 2, the 3D encoder may be used to extract the first point cloud feature vector h of the enhanced point cloud data ^C 。

Step 2, performing proximity sampling processing on the multi-angle rendering image set through a structured multi-mode data organization sub-network to obtain a multi-angle rendering image subset, and extracting a first image feature vector of the multi-angle rendering image subset; and determining a target parent category text and a target sub-category text to which the three-dimensional model belongs from the category text tree, and extracting a first text feature vector of the target sub-category text.

In order to facilitate understanding of step 2, embodiments of the present invention provide an implementation manner of extracting a first image feature vector and a first text feature vector through an SMO sub-network, respectively. See modes one to two below:

in one mode, a first image feature vector is extracted, see steps 2.1 to 2.4 below:

step 2.1, performing proximity sampling processing on the multi-angle rendering image set according to a preset angle threshold value to obtain a multi-angle rendering image subset; wherein, the difference value of rendering angles between two adjacent rendering images in the multi-angle rendering image subset is smaller than a preset angle threshold.

On the image module, embodiments of the present invention render multiple images (i.e., render images) from different angles for each three-dimensional model. In practical design, the embodiment of the invention performs rendering every 12 degrees, and 30 color rendering images can be obtained in total; meanwhile, the embodiment of the invention performs rendering of the depth image, and 30 corresponding depth images are obtained; the color-rendered image and the depth image are taken as a multi-angle-rendered image set.

In the actual training process, the embodiment of the invention designs a proximity sampling mode, namely, for a multi-angle rendering image set in each training process, the sampling result of the embodiment of the invention ensures that the rendering angle difference value of each rendering image is within a certain range (namely, the preset angle threshold value), namely:

Wherein [ I ] ₁ ,…,I _v ]Rendering a subset of images for multiple angles, C _I Rendering an image set for multiple angles, I _i 、I _j Two rendering images in the multi-angle rendering image subset are rendered, and omega is a preset angle threshold.

Step 2.2 extracts an initial image feature vector for each of the rendered images in the subset of multi-angle rendered images.

In one embodiment, after a certain number of rendered images are sampled during each training process, embodiments of the present invention use a pre-trained image encoder to extract the corresponding initial image feature vector for each rendered image.

And 2.3, determining an angle position code and a depth map position code corresponding to each rendering image according to the rendering angle of each rendering image in the multi-angle rendering image subset. Wherein the angular position code identifies that the initial image feature vector for each of the rendered images is from a different angle and the depth map position code identifies whether the initial image feature vector for each of the rendered images contains color information or depth information.

And 2.4, marking the angle position codes and the depth map position codes to the initial image feature vectors of each rendering image to obtain first image feature vectors of each rendering image.

In one embodiment, unique angular position codes and depth map position codes may be applied to the initial image feature vectors corresponding to the rendered images at different rendering angles to obtain a first image feature vector h for each of the rendered images ^I . In practical applications, a matrix is randomly initialized, and a corresponding vector is selected from the matrix according to the rendering angle of the rendered image as a code to be added, for example, for the rendered image obtained by rendering from 0 °, the first column vector of the matrix is selected to be added to the initial image feature vector of the rendered image. Specifically, the first image feature vector is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,to render image I _iv Epsilon of the first image feature vector of (a) ^degree For angular position coding, epsilon ^degree And (5) position coding of the depth map.

Further, after determining the first image feature vector of each rendered image, normalization may be performed on all the first image feature vectors for each rendering angle using a layer normalization method.

Mode two, extracting a first text feature vector:

on a text module, the embodiment of the invention designs a group of hierarchical tree classification methods for each three-dimensional model. For each three-dimensional model, it is first divided into a broad class, e.g., a coarser class text for aircraft, beds, etc., as a parent class. Then, for each parent category, it is again continued to be assigned a finer child category. The parent and child categories may be labeled T, respectively ^p ,T ^s 。

Therefore, through the structured multi-mode data organization module, the embodiment of the invention constructs a group of triplet data which is respectively composed of a three-dimensional model, a multi-angle rendering image set thereof and a category text tree thereof, and the method is specifically as follows:

S _i :([I _i1 ,…,I _iv ],[T ^p ,T ^s ],C _i )；

wherein S is _i Is a ternary data group, [ I ] _i1 ,…,I _iv ]Rendering a set of images for multiple angles [ T ] ^p ,T ^s ]Is a category text tree, C _i Is a three-dimensional model.

Likewise, the embodiment of the invention uses a text encoder in the pre-training model to extract the text of the sub-category as the first text feature vector h ^T 。

Through the steps 1 to 2, the characteristic representation of three modes of images, texts and three dimensions can be obtained:

wherein, the liquid crystal display device comprises a liquid crystal display device,respectively the ith three-dimensional model C _i Is a first point cloud feature vector, a three-dimensional model C _i Is a rendered image I of (1) _i First image feature vectorThree-dimensional model C _i Sub-category text T _i Is included in the text feature vector.

For the foregoing step S106, when the step of determining the joint modality feature vector is performed, considering that the object of the present invention may be abstracted to obtain a joint probability distribution of point cloud, image, text, it may be written as:

P(C,I,T)＝P(C|I,T),P(I,T)；

since the joint probability distribution of image-text comes from a pre-trained graphic element (i.e., the image element and the text element), the above formula is equivalent to:

The formula can be expressed as a joint probability distribution of three modalities for any one three-dimensional model, which is equivalent to the distribution of all text multiplied by the probability distribution of all rendered images under the text condition. Based on this formula, the joint modality feature vector may be determined as follows steps a through b:

step a, performing matrix multiplication on a first image feature vector and a first text feature vector through a cross-modal joint condition modeling network to obtain an confusion matrix; wherein the confusion matrix is used to characterize the probability distribution of the target subcategory text relative to the first image feature vector at different rendering angles.

In particular implementation, the first text feature vector h may be ^T First image feature vector h corresponding to a rendered image of different rendering angles ^I And performing matrix multiplication to obtain an confusion matrix.

And b, weighting the first image feature vectors with different rendering angles by using a confusion matrix through a cross-modal joint condition modeling network to obtain joint modal feature vectors.

In specific implementation, the joint modal feature vector h may be determined according to the following formula ^J ：

Wherein, the liquid crystal display device comprises a liquid crystal display device,for the ith three-dimensional model C _i Is described.

For the steps S104 to S106, structured multi-mode data is constructed starting from the data and modeling principles respectively, and the characteristic intensity of the image-text data is supplemented. And then, constructing unified features of a joint mode through joint condition modeling, and enabling the three-dimensional features to have semantic representation unified with the graphic features through a contrast learning mode.

For the aforementioned step S108, when the step of training the initial feature extraction network is performed, it can be seen that (1) to (3) below:

(1) And combining the feature vectors of any two modes in the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain a combined feature vector, and determining a first target loss value based on each combined feature vector by utilizing a contrast learning algorithm.

In practical application, because the three-dimensional feature representation method (JM 3D) based on multi-modal feature aggregation is a cross-modal joint alignment method, the embodiment of the invention considers that three-dimensional representation, images and text modes are aligned in a training strategy. The embodiment of the invention selects a method for using contrast learning to pull up the abstract distances of different semantic spaces. The contrast learning method can be expressed in an abstract way as follows:

Wherein M represents an arbitrary modality.

On this basis, the present embodiment provides an implementation of determining the first target loss value, see (1.1) to (1.2) below:

(1.1) determining a first sub-loss value based on the first point cloud feature vector and the first text feature vector using a contrast learning algorithm; and determining a second sub-loss value based on the first point cloud feature vector and the joint modality feature vector; and determining a third sub-penalty value based on the first text feature vector and the joint modality feature vector. In one embodiment, any two modalities may be selected for combination and semantic representations among different modalities are pulled in a manner that minimizes the distance between positive samples and maximizes the distance between negative samples.

In specific implementation, substituting the first point cloud feature vector and the first text feature vector into the formulaDetermining a sub-loss value between two modes of the point cloud and the text; similarly, substituting the first point cloud feature vector and the joint mode feature vector into the formula ∈>Determining a sub-loss value between two modes of the point cloud and the joint mode; similarly, substituting the first text feature vector and the joint modality feature vector into the formula ∈ >To determine a sub-loss value between the text and the two modalities of the joint modality.

(1.2) weighting the first sub-loss value, the second sub-loss value and the third sub-loss value to obtain a first target loss value. In a specific implementation, the first target loss value may be determined according to the following formula:

wherein L is _comtrastive For the first target loss value,for the first sub-loss value, < >>For the second sub-loss value, < >>Lambda is the third sub-loss value ₂ A weight coefficient lambda for the first sub-loss value ₁ A weight coefficient lambda as a second sub-loss value ₃ And the weight coefficient of the third sub-loss value. From this formula, it can be found that the final contrast learning penalty comes from the alignment of the point cloud features with the text features, the point cloud features and the joint features, the joint features and the text features.

(2) And clustering the point cloud data corresponding to the three-dimensional model based on the target parent class text, and determining a second target loss value based on the clustered point cloud data.

In one embodiment, in order to enable the point cloud features to complete clustering under the guidance of coarse-granularity parent class texts, the embodiment of the invention uses the parent class texts as labels to guide the point cloud feature vectors to perform simple classification tasks, as follows:

Wherein, the liquid crystal display device comprises a liquid crystal display device,for the ith three-dimensional model C _i Is a first point cloud feature vector of->For the parent category text, θ is the full connection layer.

In specific implementation, the point cloud features which are forced to have the same parent class text in the task are divided into the same set, and a second target loss value L is calculated according to the following formula _clossed ：

Where Softmax represents the normalization function and N is the number of training samples.

(3) And training the initial feature extraction network by utilizing the sum value of the first target loss value and the second target loss value to obtain a target feature extraction network. In one embodiment, a first target loss value L is calculated _comtrastive And a second target loss value L _clossed And training the initial feature extraction network by taking the sum as a total loss value to obtain a target feature extraction network.

Further, during the testing process, different encoders are selected to be used according to the requirements of different tasks. For example, in the case of a zero sample three-dimensional detection task, the embodiment of the invention uses the trained three-dimensional features and text features to perform similarity calculation at the same time, and selects the text with the highest similarity as the category of the three-dimensional model.

In summary, the embodiment of the invention provides a multi-view joint mode three-dimensional feature modeling method based on multi-mode feature aggregation, which is used for obtaining unified characterization of point cloud, text and images. In particular, embodiments of the present invention provide a novel structured multi-modal data organization (SMO) to address information degradation issues. This module enriches the representation of visual and linguistic modalities by introducing continuous multi-view images and layered text, thereby increasing the lower limit of the three-dimensional model feature representation capability. In addition, the embodiment of the invention designs a cross-modal joint condition modeling module (JMA) to solve the problem of insufficient synergy, and models the joint mode by integrating language knowledge into the visual mode. And the joint modal characteristics, the text characteristics and the point cloud characteristics are sent into a contrast learning method, so that the three-dimensional characteristics are aligned with the text characteristics and the image characteristics respectively. Thereby obtaining a unified characterization of the point cloud, text and images. The method achieves the best effect in a plurality of downstream tasks and actual scenes.

On the basis of the foregoing embodiment, the embodiment of the present invention further provides a three-dimensional feature representation method, referring to a schematic flow chart of the three-dimensional feature representation method shown in fig. 3, the method mainly includes the following steps S302 to S306:

step S302, a target three-dimensional model to be processed, a target multi-angle rendering image set corresponding to the target three-dimensional model and a target category text are obtained.

Step S304, extracting a second point cloud feature vector of a target three-dimensional model, a second image feature vector of a target multi-angle rendering image set and a second text feature vector of a target category text through a target feature extraction network obtained through pre-training; the target feature extraction network is obtained by training a multi-mode feature extraction network by a training method.

In one embodiment, referring to a schematic diagram of the structure of a target feature extraction network shown in fig. 4, the target feature extraction sub-network mainly includes a 3D encoder, an image encoder, and a text encoder. Based on this, a second point cloud feature vector of the target three-dimensional model may be extracted by a 3D encoder in the target feature extraction network, a second image feature vector of the target multi-angle rendered image set may be extracted by an image encoder, and a second text feature vector of the target category text may be extracted by a text encoder.

And step S306, one or more of the second point cloud feature vector, the second image feature vector and the second text feature vector are used as three-dimensional feature representation of the target three-dimensional model.

According to the three-dimensional feature representation method provided by the embodiment of the invention, the multi-modal feature of the target three-dimensional model is extracted by utilizing the target feature extraction network obtained by training the multi-modal feature extraction network by using the training method, so that the point cloud features have uniform semantic expression with the image features and the text features, and the feature representation of the target three-dimensional model is better realized.

To verify the feature representation effect of the target feature extraction network described above, embodiments of the present invention use shape net55 as the pre-training dataset, which is a public subset of shape net. Shapen includes 52500 CAD models with multiple texture maps and corresponding class annotations. These annotations total 55 base categories and 205 fine-grained subcategories, with a small portion of the models lacking subcategories. During training, embodiments of the present invention randomly sample different numbers of points from the CAD model to accommodate different networks. On the test dataset we used ModelNet40 and ScanObjectNN as test data, respectively. ModelNet40 consists of a synthetic 3D CAD model, including 9843 training samples and 2468 test samples, spanning 40 classes. For testing, the method followed by the embodiments of the present invention downsamples the point cloud data to 1024 points. Whereas ScanObjectNN is different from model net 40. It is a 3D object dataset scanned from a real scene, comprising 2902 samples in 15 categories. It can be classified into two categories according to whether the background is contained: obj_only and obj_bj. The former refers to a clean grid and the latter includes background noise. Here, the embodiment of the present invention follows the method of ULIP, with pre-processed data in use, normalized and downsampled to 1024 points.

The embodiment of the invention adopts a zero-sample three-dimensional detection task as a reference task, and the actual effect of using an accuracy evaluation model is compared with that of other existing methods, as shown in the following table 1.

TABLE 1

Wherein Top-1 is defined as the accuracy of the resulting calculation using the highest probability in the candidate result set; top-5 is defined as the accuracy of the resulting calculation using the first five probabilities of the candidate result set being the greatest.

The zero sample three-dimensional object classification results on model net40 and ScanObjctNN are shown in table 1. First, the method of the present embodiment is superior to the previous SOTA method ULIP [23] in terms of top-1 accuracy on all 3D backbones, and with varying degrees of improvement, top-1 accuracy on the "all" dataset is still improved by 4.3%, 12.3% on the "medium" dataset, and 13.7% on the "difficult" dataset. This indicates the superiority of JM 3D. Then, the present embodiment also demonstrates the effectiveness of JM3D on ScanObjectNN. It can be seen that the JM3D+PointMLP method of the embodiments of the present invention is superior to ULIP 2.9% in top-1 accuracy. In summary, embodiments of the present invention have made more impressive developments than the former SOTA method ULIP [46 ]. This shows that JM3D has good generalization performance and better performance in actual scan scenes.

In addition, see a schematic diagram of a partial image point cloud regression result shown in fig. 5. Notably, JM3D imparts enhanced cross-modal capabilities to the underlying point cloud model. In addition to the quantitative analysis consistent with the language mode performed above, the embodiment of the invention mainly shows qualitative results of the image interaction capability realized by JM 3D. Embodiments of the present invention collect images from the real world image dataset Caltech101 to retrieve 3D models in the modelet 40 test set, which is a medium-scale dataset, with over 2500 models across 40 categories. At the same time, some more challenging samples are constructed, which typically have unique perspectives, making them difficult to identify by conventional models. Fig. 5 shows the first three search models. In fig. 5, the samples belong to the "airplane" and "notebook" categories, each further being divided into two classes: a challenging level (up) and a simple level (down). Obviously, for a simple sample, all models show a significant level of retrieval. However, ULIP cannot identify an appropriate point cloud when an image is taken from an irregular perspective. In contrast, the dual view of JM3D training can identify some correct results, while increasing the number of images in SMO to 4, JM3D can successfully locate almost all appropriate models. These visualizations show that the model of the present embodiment has learned meaningful features across vision and 3D modalities. Furthermore, this suggests that while increasing the number of views in SMO may have less impact on the performance of text-based representation, the alignment capability of the model in the image domain may be greatly improved.

In summary, the embodiment of the invention provides a conditional probability modeling method for aggregating multi-modal information to strengthen the feature representation of a three-dimensional model. The embodiment of the invention starts from the data and modeling principles respectively, and provides two networks for respectively strengthening multi-modal data and optimizing multi-modal joint modeling, namely a structured multi-modal data organization (SMO) sub-network and a cross-modal joint condition modeling (JMA) network. The two networks firstly construct structured multi-mode data, and the characteristic intensity of the image-text data is supplemented. And then, constructing unified features of a joint mode through joint condition modeling, and enabling the three-dimensional features to have semantic representation unified with the graphic features through a contrast learning mode.

The embodiment of the invention comprises the following steps: 1) SMO, multi-modal data organization enhancement: giving a three-dimensional model, and rendering the model by a structured multi-modal data organization module to obtain a plurality of rendering images with continuous angles; on the other hand, the corresponding parent and child categories of the three-dimensional model are organized as classification trees. Then, a pre-trained graphic model extracts the characteristics h of the rendered images and text types respectively ^I ,h ^T . Wherein two position encodings are introduced and loaded onto corresponding image features in order to identify image features from different angles. 2) Three-dimensional point cloud feature extraction: extracting corresponding characteristic h of three-dimensional model by using point cloud network ^C . 3) JMA, cross-modal joint condition modeling feature: the cross-modal joint condition modeling module constructs a confusion matrix of image features and text features through matrix multiplication, and uses the matrix to carry out weighted summation on the image features to obtain cross-modal joint condition features h ^J The method comprises the steps of carrying out a first treatment on the surface of the 4) Training by contrast learning method: finally, three-dimensional feature h ^C Respectively and jointly with feature h ^J Text feature h ^T Alignment is performed using a contrast learning method, minimizing the distance between the three to optimize the three-dimensional features. In the reasoning process, the trained three-dimensional features are used to adapt to different requirements, such as three-dimensional detection, three-dimensional segmentation, three-dimensional classification, three-dimensional regression of images and other downstream tasks.

Based on this, the training method of the multi-modal feature extraction network provided by the embodiment of the invention has at least the following characteristics:

1) The embodiment of the invention provides a structured multi-modal data organization (SMO) sub-network to solve the problem of information degradation, and constructs a continuous multi-view rendering image sequence and a hierarchical text tree. By enhancing visual and textual features, SMO compensates for the loss of 3D visual features, ensuring a more comprehensive characterization.

2) In order to solve the problem of insufficient synergy, the embodiment of the invention designs a cross-modal joint condition modeling (JMA) network, and combines text and visual modalities to obtain joint characterization. This approach significantly avoids suboptimal optimization performance and promotes a tighter understanding of the image-text pairs.

3) The JM3D method provided by the embodiment of the invention obtains the most advanced performance in various downstream tasks, especially in the aspect of zero sample 3D classification. JM3D provides a 4.3% improvement for PointNet 40 dataset and a 6.5% improvement for PointNet++ without introducing any additional complex structural design.

For the training method of the multi-modal feature extraction network provided in the foregoing embodiment, the embodiment of the present invention provides a training device of the multi-modal feature extraction network, referring to a schematic structural diagram of the training device of the multi-modal feature extraction network shown in fig. 6, the device mainly includes the following parts:

a first data acquisition module 602, configured to acquire a multimodal training data set; the multi-modal training data set comprises a three-dimensional model, a multi-angle rendering image set corresponding to the three-dimensional model and a category text tree;

The first feature extraction module 604 is configured to extract, through an initial feature extraction network, a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendered image set, and a first text feature vector of the category text tree, respectively;

a joint feature determination module 606 for determining a joint modality feature vector based on the first image feature vector and the first text feature vector by cross-modality joint condition modeling the network;

the network training module 608 is configured to train the initial feature extraction network based on the first point cloud feature vector, the first text feature vector, and the joint modal feature vector, to obtain a target feature extraction network.

According to the training device of the multi-mode feature extraction network, which is provided by the embodiment of the invention, the representation of visual and language modes is enriched by introducing the multi-angle rendering image set and the category text tree, so that the upper limit of the feature representation capability of the three-dimensional model is improved, and the problem of information degradation is solved; in addition, after the multi-mode feature vector is extracted through the initial feature extraction network, a cross-mode joint condition modeling network is utilized, and the joint mode feature vector is determined based on the first image feature vector and the first text feature vector, so that language knowledge is integrated into the visual mode to model the joint mode, and the problem of insufficient synergy is solved; and finally, training the initial feature extraction network by using the first point cloud feature vector, the first text feature vector and the joint mode feature vector, so that the target feature extraction network obtained by training obtains unified characterization of point cloud, text and images when the feature extraction is performed.

In one embodiment, the initial feature extraction network includes a point cloud feature extraction sub-network and a structured multi-modal data organization sub-network; the first feature extraction module 604 is further configured to:

extracting a first point cloud feature vector of the three-dimensional model through a point cloud feature extraction sub-network;

performing proximity sampling processing on the multi-angle rendering image set through a structured multi-mode data organization sub-network to obtain a multi-angle rendering image subset, and extracting a first image feature vector of the multi-angle rendering image subset; and determining a target parent category text and a target sub-category text to which the three-dimensional model belongs from the category text tree, and extracting a first text feature vector of the target sub-category text.

In one implementation, the first feature extraction module 604 is further configured to:

determining point cloud data corresponding to the three-dimensional model;

performing data enhancement processing on the point cloud data to obtain enhanced point cloud data;

and extracting a first point cloud feature vector of the enhanced point cloud data through the point cloud feature extraction sub-network.

according to a preset angle threshold, performing proximity sampling processing on the multi-angle rendering image set to obtain a multi-angle rendering image subset; wherein, the difference value of rendering angles between two adjacent rendering images in the multi-angle rendering image subset is smaller than a preset angle threshold;

Extracting an initial image feature vector of each rendered image in the multi-angle rendered image subset;

determining an angle position code and a depth map position code corresponding to each rendering image according to the rendering angle of each rendering image in the multi-angle rendering image subset;

and labeling the angle position code and the depth map position code to the initial image feature vector of each rendering image to obtain a first image feature vector of each rendering image.

In one implementation, the joint feature determination module 606 is further configured to:

performing matrix multiplication on the first image feature vector and the first text feature vector through a cross-modal joint condition modeling network to obtain an confusion matrix; the confusion matrix is used for representing probability distribution of the target subcategory text relative to the first image feature vectors of different rendering angles;

and weighting the first image feature vectors with different rendering angles by using a confusion matrix through a cross-modal joint condition modeling network to obtain joint modal feature vectors.

In one embodiment, the network training module 608 is further configured to:

combining the feature vectors of any two modes in the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain a combined feature vector, and determining a first target loss value based on each combined feature vector by utilizing a contrast learning algorithm;

Clustering point cloud data corresponding to the three-dimensional model based on the target parent class text, and determining a second target loss value based on the clustered point cloud data;

and training the initial feature extraction network by utilizing the sum value of the first target loss value and the second target loss value to obtain a target feature extraction network.

In one embodiment, the network training module 608 is further configured to:

determining a first sub-loss value based on the first point cloud feature vector and the first text feature vector by using a contrast learning algorithm; and determining a second sub-loss value based on the first point cloud feature vector and the joint modality feature vector; and determining a third sub-penalty value based on the first text feature vector and the joint modality feature vector;

and weighting the first sub-loss value, the second sub-loss value and the third sub-loss value to obtain a first target loss value.

For the three-dimensional feature representing method provided in the foregoing embodiment, the embodiment of the present invention provides a three-dimensional feature representing apparatus, referring to a schematic structural diagram of a three-dimensional feature representing apparatus shown in fig. 7, the apparatus mainly includes the following parts:

the second data obtaining module 702 is configured to obtain a target three-dimensional model to be processed, a target multi-angle rendering image set corresponding to the target three-dimensional model, and a target class text;

A second feature extraction module 704, configured to extract, through a target feature extraction network obtained by training in advance, a second point cloud feature vector of a target three-dimensional model, a second image feature vector of a target multi-angle rendering image set, and a second text feature vector of a target class text; the target feature extraction network is obtained by training a multi-mode feature extraction network by a training method;

the feature representation determining module 706 is configured to use one or more of the second point cloud feature vector, the second image feature vector, and the second text feature vector as a three-dimensional feature representation of the target three-dimensional model.

According to the three-dimensional feature representation device provided by the embodiment of the invention, the multi-modal feature of the target three-dimensional model is extracted by utilizing the target feature extraction network obtained by training the multi-modal feature extraction network by using the training method, so that the point cloud features have uniform semantic expression with the image features and the text features, and the feature representation of the target three-dimensional model is better realized.

The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

The embodiment of the invention provides electronic equipment, which comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs:

a training method of a multi-modal feature extraction network, comprising:

determining, by the cross-modal joint condition modeling network, a joint modal feature vector based on the first image feature vector and the first text feature vector;

In one embodiment, the initial feature extraction network includes a point cloud feature extraction sub-network and a structured multi-modal data organization sub-network; extracting, by the initial feature extraction network, a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set, and a first text feature vector of the category text tree, respectively, including:

In one embodiment, extracting a first point cloud feature vector of a three-dimensional model through a point cloud feature extraction sub-network includes:

determining point cloud data corresponding to the three-dimensional model;

In one embodiment, performing proximity sampling on the multi-angle rendering image set to obtain a multi-angle rendering image subset, extracting a first image feature vector of the multi-angle rendering image subset, including:

In one embodiment, determining, by a cross-modal joint condition modeling network, a joint modal feature vector based on a first image feature vector and a first text feature vector, includes:

In one embodiment, training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector, and the joint modal feature vector to obtain a target feature extraction network includes:

In one embodiment, combining feature vectors of any two modes of the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain a combined feature vector, and determining a first target loss value based on each combined feature vector by using a contrast learning algorithm, including:

According to the electronic equipment provided by the embodiment of the invention, the representation of the visual and language modes is enriched by introducing the multi-angle rendering image set and the category text tree, so that the upper limit of the characteristic representation capability of the three-dimensional model is improved, and the problem of information degradation is solved; in addition, after the multi-mode feature vector is extracted through the initial feature extraction network, a cross-mode joint condition modeling network is utilized, and the joint mode feature vector is determined based on the first image feature vector and the first text feature vector, so that language knowledge is integrated into the visual mode to model the joint mode, and the problem of insufficient synergy is solved; and finally, training the initial feature extraction network by using the first point cloud feature vector, the first text feature vector and the joint mode feature vector, so that the target feature extraction network obtained by training obtains unified characterization of point cloud, text and images when the feature extraction is performed.

A three-dimensional feature representation method, comprising:

Extracting a second point cloud feature vector of a target three-dimensional model, a second image feature vector of a target multi-angle rendering image set and a second text feature vector of a target class text through a target feature extraction network obtained through pre-training; the target feature extraction network is obtained by training a multi-mode feature extraction network by a training method;

According to the electronic equipment provided by the embodiment of the invention, the target feature extraction network obtained by training the multi-modal feature extraction network by using the training method of the multi-modal feature extraction network is used for extracting the multi-modal features of the target three-dimensional model, so that the point cloud features have uniform semantic expression with the image features and the text features, and the feature representation of the target three-dimensional model is better realized.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 100 includes: a processor 80, a memory 81, a bus 82 and a communication interface 83, the processor 80, the communication interface 83 and the memory 81 being connected by the bus 82; the processor 80 is arranged to execute executable modules, such as computer programs, stored in the memory 81.

The memory 81 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 83 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc.

Bus 82 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.

The memory 81 is configured to store a program, and the processor 80 executes the program after receiving an execution instruction, and the method executed by the apparatus for flow defining disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 80 or implemented by the processor 80.

The processor 80 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 80. The processor 80 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 81 and the processor 80 reads the information in the memory 81 and in combination with its hardware performs the steps of the method described above.

A computer program product of a readable storage medium according to an embodiment of the present invention includes a computer readable storage medium storing program code including instructions operable to perform:

a training method of a multi-modal feature extraction network, comprising:

determining point cloud data corresponding to the three-dimensional model;

The readable storage medium provided by the embodiment of the invention enriches the representation of visual and language modes by introducing the multi-angle rendering image set and the category text tree, thereby improving the upper limit of the characteristic representation capability of the three-dimensional model and further improving the problem of information degradation; in addition, after the multi-mode feature vector is extracted through the initial feature extraction network, a cross-mode joint condition modeling network is utilized, and the joint mode feature vector is determined based on the first image feature vector and the first text feature vector, so that language knowledge is integrated into the visual mode to model the joint mode, and the problem of insufficient synergy is solved; and finally, training the initial feature extraction network by using the first point cloud feature vector, the first text feature vector and the joint mode feature vector, so that the target feature extraction network obtained by training obtains unified characterization of point cloud, text and images when the feature extraction is performed.

A three-dimensional feature representation method, comprising:

The readable storage medium provided by the embodiment of the invention utilizes the target feature extraction network obtained by training the training method of the multi-modal feature extraction network to extract the multi-modal features of the target three-dimensional model, so that the point cloud features have uniform semantic expression with the image features and the text features, and the feature representation of the target three-dimensional model is better realized.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training a multi-modal feature extraction network, comprising:

2. The method of training a multimodal feature extraction network of claim 1 wherein the initial feature extraction network comprises a point cloud feature extraction sub-network and a structured multimodal data organization sub-network; extracting, by an initial feature extraction network, a first point cloud feature vector of the three-dimensional model, a first image feature vector of the multi-angle rendering image set, and a first text feature vector of the category text tree, respectively, including:

extracting a first point cloud feature vector of the three-dimensional model through the point cloud feature extraction sub-network;

Performing proximity sampling processing on the multi-angle rendering image set through the structured multi-mode data organization sub-network to obtain a multi-angle rendering image subset, and extracting a first image feature vector of the multi-angle rendering image subset; and determining a target parent category text and a target sub-category text to which the three-dimensional model belongs from the category text tree, and extracting a first text feature vector of the target sub-category text.

3. The training method of the multi-modal feature extraction network according to claim 2, wherein extracting, by the point cloud feature extraction sub-network, a first point cloud feature vector of the three-dimensional model includes:

determining point cloud data corresponding to the three-dimensional model;

4. The training method of the multi-modal feature extraction network according to claim 2, wherein performing proximity sampling processing on the multi-angle rendered image set to obtain a multi-angle rendered image subset, extracting a first image feature vector of the multi-angle rendered image subset, includes:

According to a preset angle threshold, performing proximity sampling processing on the multi-angle rendering image set to obtain a multi-angle rendering image subset; wherein a rendering angle difference between two adjacent rendering images in the multi-angle rendering image subset is smaller than the preset angle threshold;

extracting an initial image feature vector of each of the rendered images in the multi-angle rendered image subset;

and labeling the angle position codes and the depth map position codes to the initial image feature vector of each rendering image to obtain a first image feature vector of each rendering image.

5. The training method of a multimodal feature extraction network of claim 2, wherein determining, by a cross-modal joint condition modeling network, a joint modality feature vector based on the first image feature vector and the first text feature vector, comprises:

performing matrix multiplication on the first image feature vector and the first text feature vector through a cross-modal joint condition modeling network to obtain an confusion matrix; wherein the confusion matrix is used for representing probability distribution of the target subcategory text relative to the first image feature vector at different rendering angles;

And weighting the first image feature vectors with different rendering angles by using the confusion matrix through a cross-modal joint condition modeling network to obtain joint modal feature vectors.

6. The training method of the multi-modal feature extraction network according to claim 2, wherein training the initial feature extraction network based on the first point cloud feature vector, the first text feature vector, and the joint modal feature vector to obtain a target feature extraction network comprises:

combining the feature vectors of any two modes in the first point cloud feature vector, the first text feature vector and the joint mode feature vector to obtain combined feature vectors, and determining a first target loss value based on each combined feature vector by utilizing a contrast learning algorithm;

7. The training method of the multi-modal feature extraction network according to claim 6, wherein combining feature vectors of any two modalities of the first point cloud feature vector, the first text feature vector, and the joint modality feature vector to obtain a combined feature vector, and determining a first target loss value based on each of the combined feature vectors by using a contrast learning algorithm, includes:

determining a first sub-loss value based on the first point cloud feature vector and the first text feature vector by using a contrast learning algorithm; and determining a second sub-loss value based on the first point cloud feature vector and the joint modality feature vector; and determining a third sub-loss value based on the first text feature vector and the joint modality feature vector;

8. A method of three-dimensional feature representation, comprising:

extracting a second point cloud feature vector of the target three-dimensional model, a second image feature vector of the target multi-angle rendering image set and a second text feature vector of the target category text through a target feature extraction network obtained through pre-training; wherein the target feature extraction network is trained by the multi-modal feature extraction network training method according to any one of claims 1-7;

9. A training device for a multi-modal feature extraction network, comprising:

10. A three-dimensional feature representation apparatus, comprising:

the second feature extraction module is used for extracting a second point cloud feature vector of the target three-dimensional model, a second image feature vector of the target multi-angle rendering image set and a second text feature vector of the target category text through a target feature extraction network obtained through pre-training; wherein the target feature extraction network is trained by the multi-modal feature extraction network training method according to any one of claims 1-7;

11. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 8.

12. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.