WO2021128611A1

WO2021128611A1 - Motion prediction method based on deep neural network, and intelligent terminal

Info

Publication number: WO2021128611A1
Application number: PCT/CN2020/080091
Authority: WO
Inventors: 胡瑞珍; 黄惠; 闫子豪
Original assignee: 深圳大学
Priority date: 2019-12-27
Filing date: 2020-03-19
Publication date: 2021-07-01
Also published as: CN111080671A; CN111080671B

Abstract

Disclosed are a motion prediction method based on a deep neural network, and an intelligent terminal. The method comprises: training a deep neural network using a data set; inputting a three-dimensional point cloud into the deep neural network; outputting a first portion and a second portion of the three-dimensional point cloud by means of the deep neural network, wherein the first portion is used as a motion sub-unit, and the second portion is used as a reference portion of a motion unit; and completing network prediction according to the output of the three-dimensional point cloud, and outputting motion information, wherein the motion information comprises motion segmentation, a motion axis and a motion type. According to the present invention, prediction results of simultaneous motion, in a static state, and components of various hinged objects, which are not structured and are possibly partially scanned, are obtained, and the motion of object components can be predicted very accurately.

Description

A motion prediction method and intelligent terminal based on deep neural network

Technical field

The invention relates to the technical field of deep learning, in particular to a motion prediction method based on a deep neural network, an intelligent terminal and a storage medium.

Background technique

In recent years, computer graphics and related fields such as computer vision and robotics have focused on inferring the possible movement of three-dimensional objects and their components, because this problem is closely related to the understanding of the intuitive functionality and functionality of objects. The more difficult problem to solve is whether and how the machine can learn to predict component motion or component motion when given only a few static states of a three-dimensional object.

Existing methods have proposed to obtain and reconstruct according to the movement of the object, represent and understand the movement of the object, and even predict the movement of the part based on the stationary object. The motivation behind these works is to understand the movement of the object more comprehensively, which is helpful for graphics applications, such as animation. Object pose correction and reconstruction, and robot applications, such as the modeling of human-computer interaction in 3D scenes.

In the field of robotics, a lot of work has focused on the problem of functional visibility prediction. Their goal is to identify areas in objects that can perform specific interactions, such as grasping or pushing. Recently applied to deep neural networks to mark images of functional visibility labels, or physical simulations to obtain human utility closely related to functional visibility. A more general method of functional visibility analysis is based on the idea of human pose hypothesis, predicting the best human pose that fits a given scene context to assist in understanding the scene. Based on the interaction between people and objects, the posture assumption of the human body can also be used to predict the functional category of the object. Closely related to functional visibility and human pose analysis is activity recognition. One example is the detection of active areas in the input scene. These areas support specific types of human activities, such as eating or watching TV. Although functional visibility detection identifies areas that can perform a specific type of movement, such as turning or sliding; the predicted movement is only described by tags and is limited to human interaction. Therefore, they cannot represent the general motion of an object. The focus of more general methods of functional visibility analysis is to understand the actions that interact with specific objects or actions in a given scene at a high level. However, these methods cannot detect or model specific movements or parts related to these actions. movement.

In computer vision, methods have been proposed to infer the state of future objects based on the description of the current object. These methods implicitly predict the movement of the object in the image and the future movement. The general solution is to train Generative Adversarial Networks (GANs) on video data to generate subsequent frames of the input image. On the other hand, the video is decomposed into content and motion components, and then based on the selected content and motion, the decomposed content and motion components are used to create subsequent frames of the video.

The work of computer graphics also has the problem of inferring the motion of three-dimensional objects. The movement of the mechanical component is explained by predicting the possible movement of the mechanical component and the entire assembly from the geometric deployment of the component. For example, creating diagram animations from concept sketches. For more general shapes, interactive landscapes are introduced, which are motion representations of objects being used in a certain way, for example, a cup is used by people to drink water. This representation can then be used to classify motion into different types of interactions and also to predict the interactions supported by the object within a few seconds of its motion. For example, a structure called a motion tree is used to obtain the relative motion of objects in the scene. The structure tree is inferred from different instances of objects found in different geometric configurations. When a three-dimensional object segmented by a component is given, the possible motion and motion parameters of the object component are predicted based on the model learned from a small amount of static motion object data set of each object. This model effectively relates the geometry of the object to its possible movement. From two unsegmented functionally similar instances or objects with the same motion but in different motion states, the possible motion of the parts of the object is predicted. Although it is possible to infer the motion of objects in the scene, it is limited by the assumption that multiple instances of objects appear in the scene. The disadvantage of the data-driven method is that the object needs to be segmented well. Some shortcomings are that the input of the designed network requires a pair of objects with the same motion state but different rotation angles as input. When it is necessary to directly obtain functional predictions in a three-dimensional scene, for example, in robot navigation, it is unrealistic to expect either a pre-segmented object or a rotating object pair.

Therefore, the existing technology needs to be improved and developed.

Summary of the invention

In view of the above-mentioned defects of the prior art, the present invention provides a motion prediction method based on a deep neural network, an intelligent terminal and a storage medium.

The technical solutions adopted by the present invention to solve the technical problems are as follows:

A method for motion prediction based on a deep neural network, wherein the method for motion prediction based on a deep neural network includes:

Use data sets to train deep neural networks;

Inputting a three-dimensional point cloud to the deep neural network;

The deep neural network outputs the first part and the second part of the three-dimensional point cloud, using the first part as a motion subunit, and the second part as a reference part of the motion unit;

The network prediction is completed according to the output of the three-dimensional point cloud, and the motion information is output. The motion information includes the motion segmentation, the motion axis, and the motion type.

In the motion prediction method based on the deep neural network, the loss function used when training the deep neural network is:

Among them, D _t represents the displacement map, S represents the segmentation, M represents the fitting motion parameter, L _rec is the reconstruction error, L _disp is the displacement error, L _seg is the segmentation error, and L _mob is the regression error of the motion parameter;

Reconstruction error represents the degree of distortion of the shape, displacement error represents the accuracy of the moving part, segmentation error and regression error describe the correctness of the motion information, including the division of motion and immobility, the position, direction and type of motion of the motion axis.

In the motion prediction method based on the deep neural network, L _rec describes the geometric error between the predicted point cloud after motion and the real point cloud after motion;

Divide the point cloud P ₀ into a reference part and a motion part, after experiencing motion

Later, the reference part remains stationary, and the moving part is rigid. Among them, P _t-1 and P _t represent two adjacent point cloud frames, so L _rec is divided into two parts:

Is the error of the reference part,

Is the error of the moving part;

Is the sum of the squares of the error distance of each point:

among them,

Is the real position of point p;

The composition is:

Among them, L _shape is used to punish points that do not match the target shape, and L _density is the local point density of the predicted point cloud and the target point cloud.

Refers to the moving part of the point cloud of the t-th frame generated by the deep neural network,

Refers to the moving part of the point cloud of the correct t-th frame. gt is the abbreviation of ground truth, which means correct.

In the motion prediction method based on the deep neural network, the difference between the motion information and the target motion information is predicted by an error loss function; the motion type includes rotation motion and translation motion.

In the motion prediction method based on the deep neural network, for the rotation motion, the loss function is as follows:

Among them, dot means dot product,

Represents the displacement map of the point cloud p at the t-th frame, and d _gt is the direction of the correct axis of motion;

It describes whether the predicted displacement is perpendicular to the real axis of motion. The specific calculation formula is:

It is the deviation of the rotation angle of each point, and the rotation angle of all points is the same. The specific calculation formula is:

Among them, σ is a constant, proj(p) represents the distance between the point p and the projection point that projects the point p onto the correct axis of motion,

It is required that each point is the same distance from the real axis of rotation before and after rotation, and the circularity of its motion is restricted. The specific calculation formula is:

In the motion prediction method based on the deep neural network, for translational motion, the loss function is as follows:

It describes whether the predicted displacement is parallel to the real axis of motion. The specific calculation formula is:

It is required that the distance moved by each point is the same, and the variance is 0. The specific calculation formula is:

In the motion prediction method based on a deep neural network, the motion information loss function is:

Among them, d, x and t are the direction of the motion axis, the position of the motion axis and the type of motion, respectively, d ^gt is the correct direction of the motion axis, x ^gt is the correct position of the motion axis, t ^gt is the correct type of motion, and H is the cross entropy .

In the motion prediction method based on a deep neural network, the number of points in the three-dimensional point cloud is 1024.

An intelligent terminal, wherein the intelligent terminal includes the above-mentioned deep neural network-based motion prediction system, and further includes: a memory, a processor, and a memory based on the memory and capable of running on the processor. A motion prediction program based on a deep neural network, which implements the steps of the above-mentioned deep neural network-based motion prediction method when the motion prediction program based on the deep neural network is executed by the processor.

A storage medium, wherein the storage medium stores a motion prediction program based on a deep neural network, and when the motion prediction program based on the deep neural network is executed by a processor, it realizes the above-mentioned motion prediction method based on the deep neural network step.

The present invention uses a data set to train a deep neural network; inputs a three-dimensional point cloud to the deep neural network; the deep neural network outputs the first part and the second part of the three-dimensional point cloud, and uses the first part as a motion subunit The second part is used as the reference part of the motion unit; the network prediction is completed according to the output of the three-dimensional point cloud, and the motion information is output. The motion information includes the motion segmentation, the motion axis, and the motion type. The present invention realizes the prediction result of simultaneous movement and components of various hinged objects that are unstructured and may be partially scanned in a static state, and can predict the movement of the object components very accurately.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of a motion prediction method based on a deep neural network of the present invention;

2 is a schematic diagram of the deep neural network learning a deep prediction model from a training set in a preferred embodiment of the motion prediction method based on a deep neural network of the present invention, and the training set covers various motions of different objects;

3 is a schematic diagram of the structure of a long and short-term memory network in a preferred embodiment of a motion prediction method based on a deep neural network of the present invention;

FIG. 4 is a schematic diagram of the movement type in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention as a rotation movement; FIG.

5 is a schematic diagram of the movement type in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention is translational movement;

FIG. 6 is a schematic diagram of a result set of motion and component prediction in different motions of various shapes of complete and partial scans in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention;

7 is a schematic diagram of predicting the parallel movement of the desk in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention;

FIG. 8 is a schematic diagram of the architecture of the reference prediction network "BaseNet" in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention;

9 is a schematic diagram of the visual comparison between MAPP-NET and BaseNet in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention;

FIG. 10 is a schematic diagram of the visualization comparison of the prediction obtained _{without the reconstruction loss item L rec} or the displacement loss item L _disp in the preferred embodiment of the motion prediction method based on the deep neural network of the present invention;

FIG. 11 is a schematic diagram of a visual comparison of motion parameters and segmentation results that are not obtained through network prediction in a preferred embodiment of a motion prediction method based on a deep neural network of the present invention;

FIG. 12 is a schematic diagram of the operating environment of the preferred embodiment of the smart terminal of the present invention.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The motion prediction method based on a deep neural network according to a preferred embodiment of the present invention, as shown in FIG. 1, is a motion prediction method based on a deep neural network, wherein the motion prediction method based on a deep neural network includes the following steps :

Step S10, use the data set to train the deep neural network;

Step S20, input the three-dimensional point cloud to the deep neural network;

Step S30: The deep neural network outputs the first part and the second part of the three-dimensional point cloud, using the first part as a motion subunit, and the second part as a reference part of the motion unit;

Step S40: Complete network prediction according to the output of the three-dimensional point cloud, and output motion information, where the motion information includes motion segmentation, motion axis, and motion type.

The present invention introduces a learning-based method, which simultaneously predicts a single unsegmented point cloud, which may be a partially scanned rotating part of a three-dimensional object, and their movement. The deep neural network of the present invention regards the input three-dimensional object as a motion unit, and outputs two parts of the point cloud. One part is used as the motion sub-unit and the other part is used as the reference part of the motion unit. This is applied iteratively. The part obtained by the network proposed by the invention can predict finer component movement, thereby obtaining the prediction of hierarchical movement and object segmentation based on movement, as shown in FIG. 2. MAPP-NET (Deep Neural Network) learns a deep prediction model from a training set that covers various movements of different objects. Although the problem of motion prediction and segmentation from a single configuration is inherently ill-posed, the learning-based method of the present invention can gather rich clues, such as the geometry of parts and their contexts from training data, and thus speculate Three-dimensional objects that have not been seen.

The core point of the point cloud motion prediction can be regarded as the prediction point pair and the displacement field that changes over time. It allows the network to process unstructured low-level input and take advantage of the instantaneous characteristics of motion. Specifically, The MAPP-NET of the present invention is implemented by a cyclic neural network, its input is a point cloud, and then the displacement of each point in the subsequent frame is predicted, and the input point cloud of each subsequent frame is the reference point. The architecture of the network is composed of encoder-decoder pairs and crossed with Long Short-Term Memory (LSTM), which also predicts the displacement field of the input point cloud; the present invention also adds additional layers to the network. Infer the motion parameters based on the motion segmentation and the predicted displacement field. Therefore, given a point cloud, MAPP-NET not only infers the motion type and motion parameters (such as rotation axis) of the geometric transformation of the point, but also predicts the segmentation of their rotatable part based on the predicted motion state.

The purpose of the present invention is to segment the movable part of a given three-dimensional object, determine the type of object motion, and generate a motion sequence of the next few frames of the object. The object is represented by a single, unsegmented point cloud. The present invention uses a deep neural network to pre-train on a data set to achieve the above goals. Therefore, the main technical problem of the present invention is how to design the network structure and loss function to accomplish the above tasks.

The input of the present invention is a three-dimensional point cloud with 1024 points. It is assumed that the point cloud has only one motion unit, that is, the points of the point cloud are either fixed or belong to the same motion. The output is a point cloud sequence. Each point cloud in the sequence has 1024 points and corresponds to the points in the input point cloud one by one. At the same time, the network also predicts and outputs the motion segmentation S, the motion axis (d, x), and the motion type t. The motion axis information includes the axis direction d and the position of a point x on the axis, which are collectively referred to as motion information M=(t, d, x).

The core of the network is to use a cyclic neural network to predict the displacement of the point in the point cloud, and the displacement is the representation of the movement. Recurrent neural network is used because this kind of network has a good effect in processing sequence data. More specifically, the present invention uses a long and short-term memory network, and uses the network structure collection abstraction layer SA and the feature transfer layer FP in PointNet++. Figure 3 illustrates the structure of the network in detail. The input point cloud P ₀ enters the cyclic neural network after passing through a collective abstraction layer. It contains several sub-networks. The sub-networks are composed of a feature transfer layer and a fully connected layer. The network outputs the motion prediction of a certain frame, that is, the displacement D. After adding it to the input point cloud, the point cloud P of several frames after the movement is obtained. With these point clouds and displacements, segmentation and motion information can be analyzed after passing through some layers. After passing several frames of displacement information into a fully connected layer, the segmentation of the point cloud can be obtained. Motion information can also be obtained separately by a similar method, but the incoming information is several frames of point cloud information after motion instead of displacement, and because of the overall consideration, it is necessary to add a collection abstraction layer before the fully connected layer. The reason why point cloud is used instead of displacement is because it is found in the experiment that the former can have higher accuracy. The specific structure can be seen in Figure 3;

Network training and loss function;

In order to train the above-mentioned multiple output network, the present invention designs the following loss function:

Reconstruction loss function, L _rec describes the geometric error between the predicted point cloud after motion and the real point cloud after motion;

Is the error of the reference part,

Is the error of the moving part;

Is the sum of the squares of the error distance of each point:

among them,

Is the real position of point p;

The composition is:

The difference between the motion information and the target motion information is predicted by the error loss function; the motion type includes rotation motion and translation motion.

For the displacement loss function (error loss function), this displacement loss function can measure the difference between the predicted motion information and the target motion information. As mentioned earlier, this is also for the motion part of the point cloud. As there are different types of exercise, there are also different forms. The present invention only considers two types of motions, rotation and translation.

For rotational motion, see Figure 4, the loss function is as follows:

Among them, dot means dot product,

Represents the displacement map of the point cloud p at the t-th frame, d _gt is the direction of the correct axis of motion;

For translational motion, see Figure 5, the loss function is as follows:

Split loss function

It is a multinomial logistic regression cross entropy (softmax cross entropy) of predicted segmentation and true segmentation.

The motion information loss function is:

Among them, d, x, and t are the direction of the motion axis, the position of the motion axis, and the type of motion, respectively, d ^gt is the correct motion axis direction, x ^gt is the correct motion axis position, t ^gt is the correct motion type, and H is the cross entropy .

The present invention completes the prediction of the future motion of the object by introducing a new cyclic neural network structure and several novel loss functions, which include the point cloud state at several moments in the future, the segmentation of the motion part, the motion type and the motion parameters.

Further, the present invention demonstrates the use of MAPP-NET to obtain the motility prediction, and evaluates the different components of the method. The present invention uses the loss function defined by the following formula (1) and the Adam random optimization sub-training network. In the experiment of the present invention, the motion unit data set is used. The present invention samples the visible surface of the unit to create a point cloud, which is called a "full scan". The present invention divides the data set into training/testing units according to the division ratio of 90/10. The present invention also obtains a set of partial scans from the test set for additional evaluation.

Figure 6 shows an example of motion prediction on the test unit for complete and partial scans. For each example, the predicted frame of the first 5 frames of each input point cloud is shown, and the predicted transformation axis, reference part and moving part are drawn. It can be observed how MAPP-NET predicts the correct part movement and generates the corresponding movement sequence for different objects with different types of motion. For example, the method of the present invention accurately predicts the rotational motion of the shape of different axis directions and positions, including horizontal and vertical axis directions, such as the flip phone shown in the first row (left) and the second row (left) shows Rotating flash drive device (U disk). The method of the present invention also accurately predicts the position of the axle, such as the luggage case shown in the fourth row (left) and the stacker example shown in the second row (right).

It can also be seen that for translational movements, such as the movement of the fifth row (right) drawer, MAPP-NET can predict the correct direction of its opening by translation, although the data only shows the front surface of the drawer, without the internal structure; because the object is wrapped The reference part is too large. A similar result was found for the handle of the drawer in the third row (left), but a different type of movement was predicted. Furthermore, we can find from the examples shown in the fifth row (left) and the last row (right) that for those input point clouds that are already close to the end of the frame, the method of the present invention has learned to stop generating new points after finding the stop state of motion. This shows that the method can predict the range of motion.

In addition, MAPP-NET can also predict the movement of multiple parts of the same object. Given an object with more than one moving part, the method of the present invention can either predict multiple motions iteratively, as shown in Fig. 2; or predict the motions of different components at the same time, especially components of different motion types. This is feasible because the present invention trains a single network to predict all the different types of motion, such as translation and rotation. As in the example of simultaneous motion in Fig. 7, the present invention shows all 5 consecutive frames of the predicted segmentation. The moving parts of the generated frame (red) are shown in a lighter color when they are closer to the input frame.

Regarding the mobility predicted by MAPP-NET in the test set, the present invention performs quantitative evaluation by measuring the motion parameters and segmentation errors, because the present invention has benchmarks that can be used. Specifically, for each test unit, the present invention uses two metrics to calculate the error of the predicted transformation axis M = (d, x) compared with the reference axis M ^gt = (d ^gt , x ^gt ). The first metric explains the error of the predicted axis direction:

E _angle = arccos(|dot(d/||d|| ₂ , d ^gt /||d ^gt || ₂ )|);

It simply represents the angle of deviation between the prediction and the reference axis, in the range [0, π/2]. The second measurement method calculates the error of the shaft position:

E _dist =min(||x-π(x)|| ₂ , 1);

π(x) projects the point x to the reference axis of motion determined ^{by M gt} =(d ^gt , x ^{gt ).} Because all shapes are regularized into one unit body, the present invention truncates the maximum distance to 1. Note that the translation does not have a defined axis position. So for translation, only the axis direction is calculated. When the classification is wrong, the transformation type error E _{seg is} set to 1; otherwise, it is 0. The segmentation error E _seg only measures the percentage of points designated as error labels.

Then, calculate the mean of each error of the two data sets: full and partial scans. The error of the method of the present invention can be seen in Table 1: It can be observed that all errors are relatively low, indicating that the accuracy of the predicted motion is very high; in addition, the method of the present invention achieves comparable results for both complete and partial scans. , Which shows the robustness of the method of the present invention.

Table 1: Motion prediction error of the method of the present invention and BaseNet

Compared with BaseNet, in order to show the advantage of using MAPP-NET to generate displacement maps before predicting all motion-related parameters, the present invention compares with the benchmark and calls it "BaseNet". BaseNet takes the point cloud P ₀ as input, and uses a standard network architecture to directly estimate the segmentation S and the motion parameter M. The network consists of an encoder/decoder pair and a fully connected layer, as shown in Figure 8. The loss function of BaseNet is:

L(S, M)=L _seg (S)+L _motion (M);

Two loss function terms defined by formula (1) are used.

Table 1 shows the comparison between MAPP-NET and BaseNet between complete and partial scans. It can be found that the segmentation error E _seg and motion type error E _{type of} BaseNet are equivalent to the method of the present invention, but its axis direction error E _angle and axis position The error E _dist is at least 5% higher than that of the present invention. The main reason for the difference in results may be that segmentation and classification tasks are simpler than motion prediction. Network architectures like PointNet++ have shown that good results can be achieved on those two tasks, but for motion prediction, a single input frame may lead to ambiguity in speculation.

In the deep learning framework of the present invention, the present invention uses a recurrent neural network to generate a sequence of multiple frames describing motion, which more restricts inference. As a result, the prediction of motion parameters is more accurate.

Figure 9 shows a visual comparison between the method of the present invention and BaseNet on some examples. Because BaseNet does not generate motion frames, it shows the axis of segmentation and prediction on the input point cloud. However, for the method of the present invention, the predicted segmentation and the axis of 5 consecutive frames are shown together. The moving parts of the generated frame are expressed in lighter colors when they are closer to the input frame. For the translation and rotation of complete and partial scans, BaseNet is more likely to predict the wrong type of motion, resulting in prediction errors of complex shapes. For example, for the keyboard drawer under the desk, the direction of the sliding motion is incorrectly predicted.

In order to further verify the loss function of the present invention, three ablation research experiments were performed on the complete scan.

The importance of L _rec and L _disp. In order to show _{the importance of L rec} and L _disp , these two items are the loss function items of the predicted displacement map D _t or the point cloud P _t compared with the benchmark. The result of the method of the present invention is compared with that without adding any of these two items. Compare the results of the items. The second and third rows of Table 2 show the error values obtained in this experiment, compared to the sixth row using the complete loss function of the present invention. Compared with the complete version of the loss function of the present invention, removing one of L _rec and L _disp increases the error, and more importantly, as shown in Figure 10, the intermediate prediction sequence compares the results obtained by using the complete loss function , Its quality is worse.

方法method	E _angle E _angle	E _dist E _dist	E _seg E _seg	E _type E _type
w/oL _rec w/oL _rec	0.3290.329	0.0190.019	0.0480.048	0.0950.095
w/oL _disp w/oL _disp	0.2180.218	0.1660.166	0.0550.055	0.0710.071
w/oL _mob w/oL _mob	0.5130.513	0.5650.565	0.1080.108	0.0680.068

w/oL _mob和L _seg w/oL _mob and L _seg	0.6230.623	0.4990.499	0.2130.213	0.0650.065
结果result	0.2090.209	0.1530.153	0.0380.038	0.0470.047

Table 2: Comparing the complete MAPP-NET and the ablation experiment of the method of removing a certain loss function term, pay attention to the importance of all terms of the loss function to get the lowest error (last row).

There is no reconstruction loss term L _rec , although thanks to the displacement loss term L _disp , the motion of the moving part looks reasonable, and points (especially those on the reference part) are more likely to move to unexpected positions.

On the other hand, when the displacement loss term L _{disp is} removed, the movement of the points of the moving part becomes inconsistent, which causes the distortion of the moving part. In contrast, the complete method of the present invention can predict an accurate and smooth movement of the moving part and can also keep the reference part unchanged.

The importance of L _mob and L _seg. In the second ablation experiment, verify _{the usage of the motion loss term L mob} and the segmentation loss term L _seg , by combining the complete network of the present invention with the predicted displacement map to infer the motion parameter M and segmentation S instead of passing through additional layers of the network To predict their methods for comparison. Specifically, the network of the present invention generates a point cloud motion sequence P _t _{from the displacement map D t} , which can be directly used to fit the motion parameter M; however, for the segmentation S, the present invention can filter some points, depending on whether they are Move more than the appropriate threshold θ of the displacement map P _t , thereby dividing the points into moving and stationary (reference) points.

In the experiment, the threshold θ=0.01 is used to determine the division. In order to fit the motion axes of each pair of adjacent frames, the optimal rigid transformation matrix is calculated, which has the smallest mean variance for transforming one frame to the next, and extracts: axis direction of translation, axis direction of rotation, and position. For the evaluation, calculated translational error E _angle in the axial direction, the axial direction of the rotation axis position and the error E _angle error E _dist. Finally, calculate the mean error of all adjacent frames of all test sequences. The fourth and fifth rows of Table 2 show the error values of this experiment.

This motion fitting method is very sensitive to noise and causes large errors; however, the prediction obtained by using the complete network of the present invention is more stable and provides better results. The comparison between the result of the motion parameter fitting and the result of the present invention is shown in FIG. 11, it can be seen that there is no motion loss term L _mob and segmentation loss term L _seg , and some outliers will cause large errors in axis fitting. However, for the example without the motion loss term L _mob , although the segmentation seems to be correct, the noise of the displacement of different points can also cause large errors in the axis fitting. For example, in the wheel shown in the second row, most of the points do not move except for the lower part of the object, which causes the position of the fitted axis to deviate from the center of the wheel.

Define the importance of _{L rec} and L _mob based on P _t. In addition, because the network of the present invention provides the displacement map D _t and the point cloud P _t as intermediate outputs, all loss terms can be defined _{based on either D t} and P _t except for the displacement loss L _disp. Therefore, a third ablation experiment was performed to show the definition of the reconstruction term L _rec and the motion loss term L _mob _{based on P t.} As the method of the present invention does, this definition is better than the definition on D _t . Table 3 shows this point proved by this experiment. The main reason for this result is that the displacement map D _t is defined between two adjacent point cloud frames P _t-1 and P _t. Therefore, the error defined on D _t _{affects the generation of P t} and also affects D _t+1 . If only the reconstruction loss item L _rec _{is independently measured on each D t} , then the accumulated error cannot be accurately taken into consideration in the learning process. Conversely, P _t is obtained by applying all previous displacement maps

To the input point cloud P ₀ . Therefore, by defining the reconstruction loss term L _rec _{on each P t} , the loss term provides a more global limit in the error of the generated sequence. The definition of the motion loss term L _mob is also applied with similar parameters.

方法method	E _angle E _angle	E _dist E _dist	E _seg E _seg	E _type E _type
L _mobon D _t L _mob on D _t	0.4410.441	0.2140.214	0.1700.170	0.0730.073
L _recon D _t L _rec on D _t	0.3500.350	0.2030.203	0.1450.145	0.0740.074
both on D _t both on D _t	0.5690.569	0.3310.331	0.3080.308	0.0950.095
结果result	0.2090.209	0.1530.153	0.0380.038	0.0470.047

Table 3: Use D _t instead of P _{t to} define the comparison of reconstruction loss term L _rec and motion loss term L _mob. The last line corresponds to the method of the present invention to define both loss terms on P _t , and obtain the lowest error

As emphasized by the experiment, the method of the present invention exhibits high accuracy in predicting the motion of an object with a single moving part. Therefore, the method of the present invention can be regarded as a good basic module for predicting the motion of an object in more situations. For example, FIG. 2 and FIG. 7 show the potential of the method of the present invention for detecting multiple moving parts in an object, including movement occurring in a parallel manner or movement in a hierarchical sequence. However, for this more complex task, further experiments are needed to quantitatively evaluate the method of the present invention, and it may be necessary to construct a data set of objects with multiple movable parts and their known motion parameters and segmentation. In addition, the current data set of the present invention assumes that the shape is a meaningful orientation and the data set is relatively small, consisting of 276 motion units. Another more direct improvement method that can be applied to more complex scenes is to enhance the data set of the present invention, by applying random transformation to the motion unit, so that the network of the present invention can operate in a pose-invariant manner, or use partial scanning To train the network to improve its anti-interference ability.

Another direction of future work is to use the motion predicted by the method of the present invention to synthesize the motion of the input shape. As part of this larger motion synthesis problem, an interesting sub-problem is learning how to complement the geometry of an object. The geometry of the object may be lost when motion occurs. For example, a drawer drawn from a cabinet should show its interior. , If the shape is scanned or the interior is not modeled, the interior will be lost. One possible method is to learn how to synthesize the missing geometry from the predicted motion and existing part geometry. This method requires at least to build a training set in the form of pre-segmented objects and model all its internal details.

The present invention introduces a loss function composed of a reconstruction loss function and a displacement loss function, which ensures that the shape of the object is maintained and the motion is accurately predicted. The reconstruction loss measures the extent to which the shape of the object is maintained during the movement, while the displacement loss measures the extent to which the displacement field describes the movement. Show that: Compared with other alternative methods, this loss function can bring the most accurate prediction. The use of the Recurrent Neural Network (RNN) architecture allows the present invention not only to predict the subsequent frames of motion, but also enables the present invention to determine when the motion stops, in addition to inferring the motion parameters, it can also infer the range of the predicted motion, such as: how far the door can be opened .

The present invention shows that MAPP-NET can predict the movement of object parts very accurately. These objects are a variety of objects with different types of motion (including rotation and translation transformation), which can be a complete point cloud of a 3D object or a partial scan. the result of. In addition, the rationality of the method of the present invention was verified and compared with the benchmark method. Finally, the present invention shows the preliminary results, that is, the network proposed by the present invention has the potential to segment objects composed of multiple moving parts in a hierarchical manner, and predict the movement of multiple components at the same time.

Technical effect:

(1) The present invention classifies the functional visibility analysis problem as segmenting the input geometry and marking the motion type and parameters of each segmentation; therefore, the deep neural network proposed by the present invention learns from pre-segmentation and three-dimensional shapes of known motions , And then perform segmentation and prediction.

(2) The deep neural network MAPP-NET of the present invention predicts the movement of a part from a three-dimensional point cloud shape, but does not need to segment the shape; the present invention trains a deep learning model to simultaneously segment the input shape and predict its parts The movement is thus achieved.

(3) The network of the present invention is trained on the motion unit data set with the reference segmentation and motion parameters identified; once the training is completed, it can be used to predict the motion of a single unsegmented point cloud representing a static state of the object .

(4) The present invention introduces a loss function composed of a reconstruction loss function and a displacement loss function, which ensures that the shape of the object is maintained while also accurately predicting the movement; the reconstruction loss measures the degree to which the shape of the object is maintained during the movement. The displacement loss measures the extent to which the displacement field describes the motion; it shows that compared with other alternative methods, this loss function can bring the most accurate prediction.

(5) The use of the Recurrent Neural Network (RNN) architecture allows the present invention not only to predict the subsequent frames of motion, but also enables the present invention to determine when the motion stops, in addition to inferring the motion parameters, it can also infer the range of the predicted motion, such as: gate energy How big is it?

(6) The present invention shows that MAPP-NET can predict the movement of object parts very accurately. These objects are a variety of objects with different motion types (including rotation and translation transformation), which can be a complete point cloud of 3D objects or It is the result of partial scanning.

(7) The present invention shows preliminary results. The network proposed by the present invention has the potential to segment objects composed of multiple moving parts in a hierarchical manner, and predict the movement of multiple components at the same time.

Further, as shown in FIG. 12, based on the above-mentioned deep neural network-based motion prediction method, the present invention also provides an intelligent terminal correspondingly. The intelligent terminal includes a processor 10, a memory 20 and a display 30. FIG. 7 only shows part of the components of the smart terminal, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead.

In some embodiments, the memory 20 may be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal. In other embodiments, the memory 20 may also be an external storage device of the smart terminal, such as a plug-in hard disk equipped on the smart terminal, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card, etc. Further, the memory 20 may also include both an internal storage unit of the smart terminal and an external storage device. The memory 20 is used to store application software and various types of data installed on the smart terminal, such as the program code of the installed smart terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In an embodiment, a motion prediction program 40 based on a deep neural network is stored in the memory 20, and the motion prediction program 40 based on a deep neural network can be executed by the processor 10, so as to realize the motion based on the deep neural network in this application. method of prediction.

The processor 10 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run the program code or process data stored in the memory 20, for example Perform the motion prediction method based on the deep neural network and so on.

In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the smart terminal and for displaying a visualized user interface. The components 10-30 of the smart terminal communicate with each other via a system bus.

In an embodiment, when the processor 10 executes the motion prediction program 40 based on the deep neural network in the memory 20, the following steps are implemented:

Use data sets to train deep neural networks;

Inputting a three-dimensional point cloud to the deep neural network;

The present invention also provides a storage medium, wherein the storage medium stores a motion prediction program based on a deep neural network, and the motion prediction program based on the deep neural network is executed by a processor to realize the motion based on the deep neural network. The steps of the prediction method; the details are as described above.

In summary, the present invention provides a motion prediction method and smart terminal based on a deep neural network, the method includes: training a deep neural network using a data set; inputting a three-dimensional point cloud to the deep neural network; The deep neural network outputs the first part and the second part of the 3D point cloud, using the first part as the motion subunit, and the second part as the reference part of the motion unit; complete network prediction based on the output of the 3D point cloud , Output motion information, the motion information includes motion segmentation, motion axis and motion type. The present invention realizes the prediction result of simultaneous movement and components of various hinged objects that are unstructured and may be partially scanned in a static state, and can predict the movement of the object components very accurately.

Of course, those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware (such as a processor, a controller, etc.) through a computer program, and the program can be stored in a computer program. In a computer-readable storage medium, the program may include the processes of the foregoing method embodiments when executed. The storage medium mentioned may be a memory, a magnetic disk, an optical disk, and the like.

It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or changes can be made based on the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

A method for motion prediction based on a deep neural network, characterized in that the method for motion prediction based on a deep neural network includes:

Use data sets to train deep neural networks;

Inputting a three-dimensional point cloud to the deep neural network;

The deep neural network outputs the first part and the second part of the three-dimensional point cloud, using the first part as a motion subunit, and the second part as a reference part of the motion unit;

The network prediction is completed according to the output of the three-dimensional point cloud, and the motion information is output. The motion information includes the motion segmentation, the motion axis, and the motion type.
The motion prediction method based on a deep neural network according to claim 1, wherein the loss function used when training the deep neural network is:

Among them, D t represents the displacement map, S represents the segmentation, M represents the fitting motion parameter, L rec is the reconstruction error, L disp is the displacement error, L seg is the segmentation error, and L mob is the regression error of the motion parameter;

Reconstruction error represents the degree of distortion of the shape, displacement error represents the accuracy of the moving part, segmentation error and regression error describe the correctness of the motion information, including the division of motion and immobility, the position, direction and type of motion of the motion axis.
The motion prediction method based on a deep neural network according to claim 2, wherein L rec describes the geometric error between the predicted point cloud after motion and the real point cloud after motion;

Divide the point cloud P 0 into a reference part and a motion part, after experiencing motion
Later, the reference part remains stationary, and the moving part is rigid. Among them, P t-1 and P t represent two adjacent point cloud frames, so L rec is divided into two parts:

Is the error of the reference part,
Is the error of the moving part;

Is the sum of the squares of the error distance of each point:

among them,
Is the real position of point p;

The composition is:

Among them, L shape is used to punish points that do not match the target shape, and L density is the local point density of the predicted point cloud and the target point cloud.
Refers to the moving part of the point cloud of the t-th frame generated by the deep neural network,
Refers to the moving part of the point cloud of the correct t-th frame. gt is the abbreviation of ground truth, which means correct.
The motion prediction method based on a deep neural network according to claim 3, wherein the difference between the motion information and the target motion information is predicted by an error loss function; the motion type includes rotation motion and translation motion.
The motion prediction method based on deep neural network according to claim 4, characterized in that, for rotating motion, the loss function is as follows:

It describes whether the predicted displacement is perpendicular to the real axis of motion. The specific calculation formula is:

Among them, dot means dot product,
Represents the displacement map of the point cloud p at the t-th frame, d gt is the direction of the correct axis of motion;
It is the deviation of the rotation angle of each point, and the rotation angle of all points is the same. The specific calculation formula is:

Among them, σ is a constant, proj(p) represents the distance between the point p and the projection point that projects the point p onto the correct axis of motion,

It is required that each point is the same distance from the real axis of rotation before and after rotation, and the circularity of its motion is restricted. The specific calculation formula is:
The motion prediction method based on a deep neural network according to claim 5, wherein for translational motion, the loss function is as follows:

It describes whether the predicted displacement is parallel to the real axis of motion. The specific calculation formula is:

It is required that the distance moved by each point is the same, and the variance is 0. The specific calculation formula is:
The motion prediction method based on a deep neural network according to claim 6, wherein the motion information loss function is:

Among them, d, x and t are the direction of the motion axis, the position of the motion axis and the type of motion, respectively, d gt is the correct direction of the motion axis, x gt is the correct position of the motion axis, t gt is the correct type of motion, and H is the cross entropy .
The motion prediction method based on a deep neural network according to claim 1, wherein the number of points in the three-dimensional point cloud is 1024.
An intelligent terminal, characterized in that, the intelligent terminal includes: a memory, a processor, and a motion prediction program based on a deep neural network that is stored in the memory and can run on the processor, and the deep neural network-based motion prediction program When the motion prediction program of the network is executed by the processor, the steps of the motion prediction method based on a deep neural network according to any one of claims 1-8 are realized.
A storage medium, characterized in that the storage medium stores a motion prediction program based on a deep neural network, and when the motion prediction program based on a deep neural network is executed by a processor, it is implemented as described in any one of claims 1-8. The steps of the motion prediction method based on deep neural network are described.