CN114241013B - Object anchoring method, anchoring system and storage medium - Google Patents

Object anchoring method, anchoring system and storage medium Download PDF

Info

Publication number
CN114241013B
CN114241013B CN202210173770.0A CN202210173770A CN114241013B CN 114241013 B CN114241013 B CN 114241013B CN 202210173770 A CN202210173770 A CN 202210173770A CN 114241013 B CN114241013 B CN 114241013B
Authority
CN
China
Prior art keywords
pose
model
neural network
training
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210173770.0A
Other languages
Chinese (zh)
Other versions
CN114241013A (en
Inventor
张旭
毛文涛
邓伯胜
于天慧
蔡宝军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yingchuang Information Technology Co ltd
Original Assignee
Beijing Yingchuang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yingchuang Information Technology Co ltd filed Critical Beijing Yingchuang Information Technology Co ltd
Priority to CN202210173770.0A priority Critical patent/CN114241013B/en
Publication of CN114241013A publication Critical patent/CN114241013A/en
Application granted granted Critical
Publication of CN114241013B publication Critical patent/CN114241013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides an object anchoring method, an anchoring system and a storage medium, wherein the object anchoring method comprises the following steps: training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object; and performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and superposing virtual information on the interested object according to the pose to realize the rendering of the interested object. The method and the device can solve the problems of user-defined object identification and 3DThe inaccuracy and illumination and environment during tracking have great influence on the algorithm, so that the method for self-defining object information gain and display of the mobile terminal is realized, and the information is displayed and matched with the object 3DThe position and attitude correspond.

Description

Object anchoring method, anchoring system and storage medium
Technical Field
The application belongs to the technical field of image recognition, and particularly relates to an object anchoring method, an anchoring system and a storage medium.
Background
Common object recognition and 3D position and posture tracking deep learning algorithms require a large amount of manual labeling data, and user-defined object training is difficult to ensure accuracy under various complex illumination and environments. In the prior art, a feature engineering method is used, and features such as SIFT and SURF are used, and although the features have certain robustness on an illumination background, the features are sensitive to a somewhat complex illumination background and are easy to fail in tracking. Many existing methods require a user to give an initial pose and provide an accurate 3D model, which cannot be tracked for objects without a 3D model.
Disclosure of Invention
To overcome, at least to some extent, the problems in the related art, the present application provides an object anchoring method, an anchoring system, and a storage medium.
According to a first aspect of embodiments herein, there is provided a method of anchoring an object, comprising the steps of:
training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object;
and performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and superposing virtual information on the interested object according to the pose to realize the rendering of the interested object.
In the above object anchoring method, the modeling is completed based on deep learning or computer vision in the process of obtaining the three-dimensional model of the object of interest through training according to the obtained image sequence containing the object of interest.
Further, the process of completing modeling based on deep learning is as follows:
extracting the characteristics of each frame of image, and estimating the camera initialization pose corresponding to each frame of image;
acquiring a mask of each frame of image by utilizing a pre-trained significance segmentation network;
model training and inference are performed to obtain a mesh of the model.
Further, the process of performing model training and inference is as follows:
in the image
Figure 100002_DEST_PATH_IMAGE001
Up random acquisitionKEach pixel point has a position coordinate of
Figure 100002_DEST_PATH_IMAGE002
Position coordinates of each pixel point by using internal parameters
Figure 100002_DEST_PATH_IMAGE003
Conversion to imaging plane coordinates
Figure 100002_DEST_PATH_IMAGE004
Inputting the imaging plane coordinates and the optimized camera pose into a neural network
Figure 100002_DEST_PATH_IMAGE005
Extracting the color difference characteristics between frames
Figure 100002_DEST_PATH_IMAGE006
(ii) a Characterizing color differences between frames
Figure 100002_DEST_PATH_IMAGE007
And adding the color difference to the original image to compensate the color difference between frames.
Wherein the color difference characteristic between frames
Figure 100002_DEST_PATH_IMAGE008
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE009
initializing the camera pose corresponding to the image
Figure 100002_DEST_PATH_IMAGE010
Input neural network
Figure 100002_DEST_PATH_IMAGE011
In the method, the optimized pose is obtained
Figure 100002_DEST_PATH_IMAGE012
Wherein the optimized pose
Figure 100002_DEST_PATH_IMAGE013
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE014
according to the optimized pose
Figure 100002_DEST_PATH_IMAGE015
Obtaining an initial position of an optimized camera
Figure 100002_DEST_PATH_IMAGE016
Wherein, the initial position of camera after optimizing is:
Figure 609336DEST_PATH_IMAGE017
in the formula (I), the compound is shown in the specification,Tis a function, which represents taking the position coordinates;
initial position of self-optimized camera
Figure 100002_DEST_PATH_IMAGE018
Emitting light rays in a direction ofwPassing through the position coordinates of the pixel points
Figure 681328DEST_PATH_IMAGE019
Wherein the direction of the lightwComprises the following steps:
Figure 100002_DEST_PATH_IMAGE020
in the direction ofwSamplingMDot
Figure 779865DEST_PATH_IMAGE021
This isMDot
Figure 100002_DEST_PATH_IMAGE022
Has the coordinates of
Figure 607224DEST_PATH_IMAGE023
Utilizing deep learning networks
Figure 100002_DEST_PATH_IMAGE024
Predict thisMDot
Figure 271555DEST_PATH_IMAGE025
Probability at the surface of the implicit equation (i.e., implicit function TSDF);
wherein, the judgment condition of the point predicted to be on the surface of the implicit equation is as follows:
Figure 100002_DEST_PATH_IMAGE026
in the formula (I), the compound is shown in the specification,
Figure 576765DEST_PATH_IMAGE027
representing points predicted to be on the surface of the implicit equation,
Figure 100002_DEST_PATH_IMAGE028
a threshold value is indicated which is indicative of,
Figure 428178DEST_PATH_IMAGE029
indicating minimum compliance
Figure 100002_DEST_PATH_IMAGE030
Will predict as points on the surface of the implicit equation
Figure 47509DEST_PATH_IMAGE031
Send into neural rendererRObtaining the values of the predicted RGB colors
Figure 100002_DEST_PATH_IMAGE032
Wherein the predicted RGB color values
Figure 769609DEST_PATH_IMAGE032
Comprises the following steps:
Figure 42458DEST_PATH_IMAGE033
according to prediction
Figure 100002_DEST_PATH_IMAGE034
Value and acquisitionKCalculating the color of each pixel point to obtain the square loss of the pixel difference value;
wherein the square loss of pixel differenceLComprises the following steps:
Figure 381167DEST_PATH_IMAGE035
in the formula (I), the compound is shown in the specification,
Figure 100002_DEST_PATH_IMAGE036
all represent coefficients;
Figure 69768DEST_PATH_IMAGE037
representing the difference values of the pixels of the image,
Figure 100002_DEST_PATH_IMAGE038
difference value representing background mask
Figure 911953DEST_PATH_IMAGE039
Difference from foreground mask
Figure 100002_DEST_PATH_IMAGE040
The sum of the total weight of the components,
Figure 27808DEST_PATH_IMAGE041
representing a difference of the edges;
in which the difference of the image pixels
Figure 100002_DEST_PATH_IMAGE042
Comprises the following steps:
Figure 119392DEST_PATH_IMAGE043
in the formula (I), the compound is shown in the specification,Pindicating all selectionskPoint;
difference of background mask
Figure 100002_DEST_PATH_IMAGE044
Comprises the following steps:
Figure 877264DEST_PATH_IMAGE045
in the formula (I), the compound is shown in the specification,
Figure 100002_DEST_PATH_IMAGE046
indicating all selectionskOut of the dots, dots outside the mask;
difference of foreground mask
Figure 105114DEST_PATH_IMAGE047
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE048
in the formula (I), the compound is shown in the specification,BCErepresenting a two-value cross-entropy loss,
Figure 188607DEST_PATH_IMAGE049
indicating all selectionskOne of the dots within the mask;
difference of edge
Figure 100002_DEST_PATH_IMAGE050
Comprises the following steps:
Figure 490189DEST_PATH_IMAGE051
in the formula (I), the compound is shown in the specification,
Figure 100002_DEST_PATH_IMAGE052
representing the boundaries of the mask;
when the model deduces, the spiritOver a network
Figure 786173DEST_PATH_IMAGE053
Deep learning network
Figure 100002_DEST_PATH_IMAGE054
And neural networks
Figure 399688DEST_PATH_IMAGE055
Input 3 in the combined model ofDPoint; the combined model is used to obtain the points present on its surface, from which a mesh is formed.
Further, the process of completing the modeling based on the computer vision is as follows:
performing feature extraction and matching by adopting a visual algorithm or a deep learning algorithm;
estimating the pose of the camera;
segmenting salient objects in the image sequence;
reconstructing the dense point cloud;
using the reconstructed dense point cloud as the input of grid generation, and reconstructing the grid of the object by using a reconstruction algorithm;
finding out texture coordinates corresponding to the grid vertex according to the camera pose and the image corresponding to the camera pose to obtain a mapping of the grid;
and obtaining a three-dimensional model according to the grids of the object and the mapping of the grids.
In the object anchoring method, the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence containing the object of interest comprises the following steps:
obtaining a synthetic data set by adopting a PBR rendering method according to the three-dimensional model and the preset scene model of the object; the synthetic dataset includes synthetic training data;
obtaining a real data set by adopting a model reprojection segmentation algorithm according to the camera pose and the object pose; the real dataset comprises real training data;
and training the six-degree-of-freedom pose estimation neural network based on deep learning by utilizing the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.
Further, the specific process of obtaining the synthetic data set by using the PBR rendering method according to the three-dimensional model and the preset scene model of the object is as follows:
reading a three-dimensional model and a preset scene model of an object;
carrying out object pose randomization, rendering camera pose randomization, material randomization and illumination randomization by adopting a PBR rendering method to obtain a series of image sequences and corresponding labeling labels; the label labels are of category, position and pose with six degrees of freedom.
Further, the specific process of obtaining the real data set by using the model re-projection segmentation algorithm according to the camera pose and the object pose is as follows:
acquiring an image sequence, a camera pose and an object pose, and segmenting an object in a real image;
synthesizing the real data with discrete poses into data with dense and continuous poses, and further obtaining a real image and a corresponding label thereof; the label labels are of category, position and pose with six degrees of freedom.
Furthermore, the specific process of training the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model is as follows:
inputting 2D coordinates of a plurality of characteristic points extracted from an image and an object, 3D coordinates corresponding to the characteristic points and an image mask;
training the six-degree-of-freedom pose estimation neural network by adopting the following loss function to obtain a six-degree-of-freedom pose estimation neural network model;
the loss function needed when training the six-degree-of-freedom pose estimation neural network is as follows:
Figure 100002_DEST_PATH_IMAGE056
in the formula (I), the compound is shown in the specification,
Figure 122924DEST_PATH_IMAGE057
the loss is indicated by an indication of,
Figure 100002_DEST_PATH_IMAGE058
are all indicative of the coefficients of the,
Figure 392363DEST_PATH_IMAGE059
a loss of classification is indicated and,
Figure 100002_DEST_PATH_IMAGE060
indicating that the loss of the bounding box,
Figure 492037DEST_PATH_IMAGE061
which represents the loss in the 2D representation,
Figure 100002_DEST_PATH_IMAGE062
which represents the loss in 3D to the user,
Figure 835425DEST_PATH_IMAGE063
which is indicative of a loss of the mask,
Figure 100002_DEST_PATH_IMAGE064
representing a projection loss;
wherein the classification is lost
Figure 401666DEST_PATH_IMAGE065
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE066
in the formula (I), the compound is shown in the specification,
Figure 220718DEST_PATH_IMAGE067
is shown to take the first placeiThe classification information of each of the detection anchor points,
Figure 100002_DEST_PATH_IMAGE068
is shown to take the first placejInformation of individual background features;
Figure 655241DEST_PATH_IMAGE069
the anchor point is represented by a representation of,
Figure 100002_DEST_PATH_IMAGE070
an anchor point representing the background is shown,
Figure 977769DEST_PATH_IMAGE071
a true value of the category is represented,
Figure 100002_DEST_PATH_IMAGE072
representing features proposed by a neural network;
loss of bounding box
Figure 42808DEST_PATH_IMAGE073
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE074
in the formula (I), the compound is shown in the specification,
Figure 83577DEST_PATH_IMAGE075
is shown asiThe coordinate characteristics of each of the detection anchor points,
Figure 100002_DEST_PATH_IMAGE076
representing the true value of the coordinate of the detection box;
2D loss
Figure 790633DEST_PATH_IMAGE077
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE078
in the formula (I), the compound is shown in the specification,
Figure 967667DEST_PATH_IMAGE079
is expressed as 2DThe characteristics of the coordinates are such that,
Figure 100002_DEST_PATH_IMAGE080
2 for representing an objectDThe true value of the characteristic point is that,
Figure 480906DEST_PATH_IMAGE081
feature points and masks representing neural network predictions;
3D loss
Figure 100002_DEST_PATH_IMAGE082
Comprises the following steps:
Figure 8970DEST_PATH_IMAGE083
in the formula (I), the compound is shown in the specification,
Figure 100002_DEST_PATH_IMAGE084
is expressed by 3DThe characteristics of the coordinates are such that,
Figure 254138DEST_PATH_IMAGE085
3 for representing an objectDThe true value of the characteristic point is that,
Figure 100002_DEST_PATH_IMAGE086
feature points and masks representing neural network predictions;
mask loss
Figure 551258DEST_PATH_IMAGE087
Comprises the following steps:
Figure 100002_DEST_PATH_IMAGE088
in the formula (I), the compound is shown in the specification,
Figure 754837DEST_PATH_IMAGE089
first to show the prospectiThe characteristics of the device are as follows,
Figure DEST_PATH_IMAGE090
indicating taking the backgroundjThe characteristics of the device are as follows,fgthe representation of the foreground is performed,bgrepresenting a background;
loss of projection
Figure 973460DEST_PATH_IMAGE091
Comprises the following steps:
Figure DEST_PATH_IMAGE092
in the formula (I), the compound is shown in the specification,
Figure 287898DEST_PATH_IMAGE093
is shown as 3DFeature projection to 2DRear sum 2DThe true value is used for making a difference value,
Figure DEST_PATH_IMAGE094
feature points and masks representing the neural network predictions.
In the object anchoring method, the rendering of the interested object is realized through a mobile terminal or through the mixing of the mobile terminal and a cloud server;
the process realized by the mobile terminal is as follows:
before tracking is started, accessing a cloud server, downloading an object model, a deep learning model and a feature database of a user, and then performing other calculations on a mobile terminal;
the mobile terminal reads camera data from the equipment, and the object pose is obtained by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom;
rendering the content to be rendered according to the pose of the object;
the process of realizing the mixing of the mobile terminal and the cloud server is as follows:
inputting an image sequence in the mobile terminal, and performing significance detection on each frame of image;
uploading the significance detection area to a cloud server for retrieval to obtain information of the object and a deep learning model related to the information, and loading the information to the mobile terminal;
estimating the pose of the object at the mobile terminal to obtain the pose of the object;
and rendering the content to be rendered according to the pose of the object.
According to a second aspect of the embodiments of the present application, there is also provided an object anchoring system, which includes a cloud training unit and an object pose calculation and rendering unit;
the cloud training unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object;
the object pose calculation and rendering unit is used for estimating the pose of the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for estimating the pose of the interested object, and superposing virtual information on the interested object to realize the rendering of the interested object;
the cloud training unit comprises a modeling unit, a synthetic training data generating unit, a real training data generating unit and a training algorithm unit;
the modeling unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object;
the synthetic training data generation unit is used for obtaining a synthetic data set according to a three-dimensional model of an object and a preset scene model, and the synthetic data set comprises synthetic training data;
the real training data generation unit is used for obtaining a real data set according to the camera pose and the object pose, and the real data set comprises real training data;
and the training algorithm unit is used for training the six-degree-of-freedom pose estimation neural network based on deep learning according to the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.
According to a third aspect of embodiments of the present application, there is also provided a storage medium having an executable program stored thereon, which when called, performs the steps in the object anchoring method described in any one of the above.
According to the above embodiments of the present application, at least the following advantages are obtained: according to the object anchoring method, the model which is used for carrying out recognition and 3D position and posture tracking by using the 2D image is trained by adopting synthetic data synthesis and real data synthesis, the problem that inaccuracy, illumination, environment and the like have great influence on the algorithm when a user self-defines object recognition and 3D tracking can be solved, and then the method for obtaining and displaying the self-defined object information of the mobile terminal is realized, and the information is displayed and corresponds to the 3D position and posture of the object.
According to the object anchoring method, the problem that workload of manual marking is large and speed is low can be solved by adopting a method of combining modeling rendering synthetic data and automatic marking real data, efficiency and accuracy of model training are improved, a deep learning model of a user-defined object can be tracked possibly, and the tracking initialization can be automatic initialization and is low in sensitivity to illumination, environment and the like.
According to the object anchoring method, the end cloud combined framework is adopted, so that large-scale object recognition and 3D position and posture tracking of the mobile terminal are possible.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification of the application, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart of an object anchoring method according to an embodiment of the present disclosure.
Fig. 2 is a block diagram of an object anchoring system according to an embodiment of the present invention.
Fig. 3 is a block diagram of a structure of a cloud-end training unit in an object anchoring system according to an embodiment of the present disclosure.
Fig. 4 is a block diagram illustrating a structure of a deep learning-based modeling unit in an object anchoring system according to an embodiment of the present disclosure.
Fig. 5 is a schematic diagram of a modeling process of a modeling unit based on computer vision in an object anchoring system according to an embodiment of the present application.
Fig. 6 is a block diagram illustrating a structure of a synthesized training data generating unit in an object anchoring system according to an embodiment of the present disclosure.
Fig. 7 is a flowchart illustrating a processing of a PBR rendering unit in an object anchoring system according to an embodiment of the present disclosure.
Fig. 8 is a flowchart illustrating a process of a composite image reality migration unit in an object anchoring system according to an embodiment of the present application.
Fig. 9 is a block diagram illustrating a structure of a real training data generating unit in an object anchoring system according to an embodiment of the present disclosure.
Fig. 10 is a flowchart of an implementation of an object pose calculation and rendering unit in an object anchoring system by a mobile terminal according to an embodiment of the present disclosure.
Fig. 11 is a flowchart of an implementation of an object pose calculation and rendering unit in an object anchoring system by mixing a mobile terminal and a cloud server according to an embodiment of the present disclosure.
Description of reference numerals:
1. a cloud training unit;
11. a modeling unit;
12. a synthetic training data generating unit; 121. a PBR rendering unit; 122. a composite image reality migration unit;
13. a real training data generating unit; 131. a model reprojection segmentation algorithm unit; 132. an inter-frame data synthesis unit;
14. a training algorithm unit;
2. and an object pose calculating and rendering unit.
Detailed Description
For the purpose of promoting a clear understanding of the objects, aspects and advantages of the embodiments of the present application, reference will now be made to the accompanying drawings and detailed description, wherein like reference numerals refer to like elements throughout.
The illustrative embodiments and descriptions of the present application are provided to explain the present application and not to limit the present application. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.
As used herein, "first," "second," …, etc., are not specifically intended to mean in a sequential or chronological order, nor are they intended to limit the application, but merely to distinguish between elements or operations described in the same technical language.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.
As used herein, "and/or" includes any and all combinations of the described items.
References to "plurality" herein include "two" and "more than two"; reference to "multiple sets" herein includes "two sets" and "more than two sets".
Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.
As shown in fig. 1, an object anchoring method provided in an embodiment of the present application includes the following steps:
and S1, training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object.
S2, performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and overlaying virtual information on the interested object according to the pose to realize the rendering of the interested object.
In the step S1, in the process of training the obtained stereoscopic model of the object of interest according to the obtained image sequence containing the object of interest, the modeling may be completed based on deep learning, or may be completed based on computer vision.
When modeling is completed based on deep learning, the specific process is as follows:
s111, extracting features and initializing camera pose estimation;
extracting each frame image
Figure 173946DEST_PATH_IMAGE095
Estimating the camera initialization pose corresponding to each frame of image
Figure DEST_PATH_IMAGE096
S112, segmenting the salient object;
obtaining each frame of image by utilizing pre-trained significance segmentation network
Figure 282847DEST_PATH_IMAGE097
Is used for forming a mask
Figure DEST_PATH_IMAGE098
S113, model training and inference;
the goal of model training is to obtain a mesh of the model.
In the image
Figure 519924DEST_PATH_IMAGE099
Random miningkEach pixel point has a position coordinate of
Figure DEST_PATH_IMAGE100
Position coordinates of each pixel point by using internal parameters
Figure 372474DEST_PATH_IMAGE101
Conversion to imaging plane coordinates
Figure DEST_PATH_IMAGE102
Inputting imaging plane coordinates and optimized camera pose into neural network
Figure 644186DEST_PATH_IMAGE103
Extracting the color difference characteristics between frames
Figure DEST_PATH_IMAGE104
(ii) a Characterizing color differences between frames
Figure 189568DEST_PATH_IMAGE105
And adding the color difference to the original image to compensate the color difference between frames.
Wherein the color difference characteristic between frames
Figure DEST_PATH_IMAGE106
Comprises the following steps:
Figure 179521DEST_PATH_IMAGE107
(1)
initializing the camera pose corresponding to the image
Figure DEST_PATH_IMAGE108
Input neural network
Figure 570182DEST_PATH_IMAGE109
In the middle, more accurate optimized pose is obtained
Figure DEST_PATH_IMAGE110
. The optimized camera pose is characterized by
Figure 696401DEST_PATH_IMAGE111
Figure DEST_PATH_IMAGE112
To representxThe angle of rotation of the shaft is such that,
Figure 147105DEST_PATH_IMAGE113
to representyThe angle of rotation of the shaft is such that,
Figure DEST_PATH_IMAGE114
to representzThe rotation angle of the shaft; the initial position of the camera is
Figure 93196DEST_PATH_IMAGE115
Wherein the optimized pose
Figure DEST_PATH_IMAGE116
Comprises the following steps:
Figure 10250DEST_PATH_IMAGE117
(2)
according to the optimized pose
Figure DEST_PATH_IMAGE118
Obtaining an initial position of an optimized camera
Figure 990975DEST_PATH_IMAGE119
Wherein, the initial position of camera after optimizing is:
Figure DEST_PATH_IMAGE120
(3)
in the formula (3), the reaction mixture is,Tis a function, which represents taking the position coordinates.
Initial position of self-optimized camera
Figure 284685DEST_PATH_IMAGE119
Emitting light rays in a direction ofwPassing through the position coordinates of the pixel points
Figure 249229DEST_PATH_IMAGE121
Wherein the direction of the lightwComprises the following steps:
Figure DEST_PATH_IMAGE122
(4)
in the direction ofwSamplingMDot
Figure 247272DEST_PATH_IMAGE123
This isMDot
Figure DEST_PATH_IMAGE124
Has the coordinates of
Figure 82504DEST_PATH_IMAGE125
Utilizing deep learning networks
Figure DEST_PATH_IMAGE126
Predict thisMDot
Figure 140590DEST_PATH_IMAGE127
Probability at the surface of the implicit equation (i.e., implicit function TSDF).
Wherein, the judgment condition of the point predicted to be on the surface of the implicit equation is as follows:
Figure DEST_PATH_IMAGE128
(5)
in the formula (5), the reaction mixture is,
Figure 61273DEST_PATH_IMAGE129
representing points predicted to be on the surface of the implicit equation,
Figure DEST_PATH_IMAGE130
is indicative of a threshold value that is,
Figure 863007DEST_PATH_IMAGE131
indicating minimum compliancem. A point satisfying equation (5) can be predicted as a point on the surface of the implicit equation.
Will predict as points on the surface of the implicit equation
Figure DEST_PATH_IMAGE132
Send into neural rendererRObtaining the values of the predicted RGB colors
Figure 552745DEST_PATH_IMAGE133
Wherein the predicted RGB color values
Figure DEST_PATH_IMAGE134
Comprises the following steps:
Figure 516153DEST_PATH_IMAGE135
(6)
according to prediction
Figure DEST_PATH_IMAGE136
Value and acquisitionKThe color of each pixel point is calculated to obtain the square loss of the pixel difference value, so that the shape of the grid is closer to the grid of the object in the image.
Wherein the square loss of pixel differenceLComprises the following steps:
Figure 924132DEST_PATH_IMAGE137
(7)
in the formula (7), the reaction mixture is,
Figure DEST_PATH_IMAGE138
are all indicative of the coefficients of the,
Figure 998398DEST_PATH_IMAGE139
the number of the channels can be 1,
Figure DEST_PATH_IMAGE140
it may be in the range of 0.5,
Figure 277064DEST_PATH_IMAGE141
can be 1;
Figure DEST_PATH_IMAGE142
representing the difference values of the pixels of the image,
Figure 411373DEST_PATH_IMAGE143
difference value representing background mask
Figure DEST_PATH_IMAGE144
Difference from foreground mask
Figure 837807DEST_PATH_IMAGE145
The sum of the total weight of the components,
Figure DEST_PATH_IMAGE146
indicating the difference of the edges.
In equation (7), the difference value of the image pixel
Figure 981343DEST_PATH_IMAGE147
Comprises the following steps:
Figure DEST_PATH_IMAGE148
(8)
in the formula (8), the reaction mixture is,Pindicating all selectionskAnd (4) point.
Difference of background mask
Figure 911253DEST_PATH_IMAGE149
Comprises the following steps:
Figure DEST_PATH_IMAGE150
(9)
in the formula (9), the reaction mixture is,
Figure 759341DEST_PATH_IMAGE151
indicating all selectionskThe dots are those outside the mask.
The physical meaning of formula (9) is: for points not on the object, the estimated background mask value is as close to 0 as possible.
Difference of foreground mask
Figure DEST_PATH_IMAGE152
Comprises the following steps:
Figure 141912DEST_PATH_IMAGE153
(10)
the physical meaning of formula (10) is: for points on the object, the estimated foreground mask value is as close to 1 as possible.
In the formulae (9) and (10),BCErepresenting a two-value cross-entropy loss,
Figure DEST_PATH_IMAGE154
indicating all selectionskThe dots are dots within the mask.
Difference of edge in the formula (7)
Figure 89139DEST_PATH_IMAGE155
Comprises the following steps:
Figure DEST_PATH_IMAGE156
(11)
in the formula (11), the reaction mixture is,
Figure 404714DEST_PATH_IMAGE157
representing the boundaries of the mask.
Equation (11) performs a loss enhancement on the edge points to increase the weight.
When model is inferred, neural network
Figure DEST_PATH_IMAGE158
Deep learning network
Figure 880826DEST_PATH_IMAGE159
And neural networks
Figure DEST_PATH_IMAGE160
Input 3 in the combined model ofDPoint; the combined model is used to obtain the points present on its surface, from which a mesh is formed.
When modeling is completed based on computer vision, the specific process is as follows:
s121, extracting and matching features by adopting a visual algorithm or a deep learning algorithm;
and extracting features from the input image sequence, matching the features, and taking the matched features as input of camera pose estimation.
The input image sequence may be a color image or a grayscale image. The algorithm for extracting and matching the features can be SIFT, HAAR, ORB and other traditional visual algorithms, and can also be a deep learning algorithm.
S122, estimating the pose of the camera;
and taking the matched features as observed quantities, and estimating the pose of the camera by using an SFM (structure-from-motion algorithm, which is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures).
S123, segmenting the salient objects in the image sequence;
and taking the camera pose as a priori, and segmenting the salient objects in the image sequence by using a salient object segmentation algorithm to serve as the input of point cloud reconstruction.
S124, reconstructing the dense point cloud;
and generating a 3D point cloud of the feature points according to the camera pose and the feature points, and obtaining dense point cloud by using a block matching algorithm.
And S125, using the reconstructed dense point cloud as the input of grid generation, and reconstructing the grid of the object by using a Poisson and other reconstruction algorithms.
And S126, finding texture coordinates corresponding to the grid vertex according to the camera pose and the image corresponding to the camera pose, and obtaining a grid map.
And S127, obtaining a three-dimensional model according to the grids of the object and the mapping of the grids.
In the step S1, the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence including the object of interest includes:
and obtaining a synthetic data set by adopting a PBR rendering method according to the three-dimensional model and the preset scene model of the object. Wherein the synthetic data set includes synthetic training data.
And obtaining a real data set by adopting a model reprojection segmentation algorithm according to the camera pose and the object pose. Wherein the real dataset comprises real training data.
And training the six-degree-of-freedom pose estimation neural network based on deep learning by utilizing the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.
In a specific embodiment, according to the three-dimensional model and the preset scene model of the object, the specific process of obtaining the synthetic data set by using the PBR rendering method includes:
reading a three-dimensional model and a preset scene model of an object;
and (4) carrying out object pose randomization, rendering camera pose randomization, material randomization and illumination randomization by adopting a PBR rendering method to obtain a series of image sequences and corresponding labeling labels. The label can be a category, a position, a pose with six degrees of freedom, and the like.
The specific process of obtaining the synthetic data set by adopting the PBR rendering method according to the three-dimensional model and the preset scene model of the object further comprises the following steps:
reading a three-dimensional model or a real image or a PBR image, and carrying out preprocessing work such as background removal on the image; synthetic images at different angles and corresponding annotation labels are generated through a deep learning Network such as GAN (generic adaptive Network) or NERF (Neural radiation Fields). The label can be a category, a position, a pose with six degrees of freedom, and the like.
In a specific embodiment, the specific process of obtaining the real data set by using the model reprojection segmentation algorithm according to the camera pose and the object pose is as follows:
acquiring an image sequence, a camera pose and an object pose, and segmenting an object in a real image;
and synthesizing the real data with the discrete poses into data with more dense and continuous poses, and further obtaining a real image and a corresponding label thereof. The label can be a category, a position, a pose with six degrees of freedom, and the like.
In a specific embodiment, the specific process of training the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model is as follows:
the method comprises the steps of inputting an image, 2D coordinates of a plurality of characteristic points extracted from an object, 3D coordinates corresponding to the characteristic points and an image mask.
And training the six-degree-of-freedom pose estimation neural network by adopting the following loss function to obtain a six-degree-of-freedom pose estimation neural network model.
The loss function needed when training the six-degree-of-freedom pose estimation neural network is as follows:
Figure 16272DEST_PATH_IMAGE161
(12)
in the formula (12), the reaction mixture is,
Figure DEST_PATH_IMAGE162
it is indicated that there is a loss of,
Figure 501611DEST_PATH_IMAGE163
are all indicative of the coefficients of the,
Figure DEST_PATH_IMAGE164
a loss of classification is indicated and,
Figure 406113DEST_PATH_IMAGE165
indicating that the loss of the bounding box,
Figure DEST_PATH_IMAGE166
which represents the loss in the 2D representation,
Figure 318705DEST_PATH_IMAGE167
which represents the loss in 3D to the user,
Figure DEST_PATH_IMAGE168
which is indicative of a loss of the mask,
Figure 675869DEST_PATH_IMAGE169
representing a projection loss.
In particular, classification loss
Figure DEST_PATH_IMAGE170
Comprises the following steps:
Figure 230478DEST_PATH_IMAGE171
(13)
in the formula (13), the reaction mixture is,
Figure DEST_PATH_IMAGE172
is shown to take the first placeiThe classification information of each of the detection anchor points,
Figure 723907DEST_PATH_IMAGE173
is shown to take the first placejInformation of individual background features.
Figure DEST_PATH_IMAGE174
The anchor point is represented by a representation of,
Figure 72980DEST_PATH_IMAGE175
an anchor point representing the background is shown,
Figure DEST_PATH_IMAGE176
a true value of the category is represented,
Figure 917439DEST_PATH_IMAGE177
representing the proposed features of the neural network.
Loss of bounding box
Figure DEST_PATH_IMAGE178
Comprises the following steps:
Figure 479002DEST_PATH_IMAGE179
(14)
formula (A), (B) and14) in (1),
Figure DEST_PATH_IMAGE180
is shown asiThe coordinate characteristics of each of the detection anchor points,
Figure 92517DEST_PATH_IMAGE181
and represents the true value of the coordinate of the detection box.
2D loss
Figure DEST_PATH_IMAGE182
Comprises the following steps:
Figure 81333DEST_PATH_IMAGE183
(15)
in the formula (15), the reaction mixture is,
Figure DEST_PATH_IMAGE184
is expressed as 2DThe characteristics of the coordinates are such that,
Figure 147509DEST_PATH_IMAGE185
2 for representing an objectDThe true value of the characteristic point is that,
Figure DEST_PATH_IMAGE186
feature points and masks representing the neural network predictions.
3D loss
Figure 512762DEST_PATH_IMAGE187
Comprises the following steps:
Figure DEST_PATH_IMAGE188
(16)
in the formula (16), the compound represented by the formula,
Figure 980784DEST_PATH_IMAGE189
is expressed by 3DThe characteristics of the coordinates are such that,
Figure DEST_PATH_IMAGE190
3 for representing objectsDFeature(s)The value of the point true is shown,
Figure 394361DEST_PATH_IMAGE191
feature points and masks representing the neural network predictions.
Mask loss
Figure DEST_PATH_IMAGE192
Comprises the following steps:
Figure 947833DEST_PATH_IMAGE193
(17)
in the formula (17), the compound represented by the formula (I),
Figure DEST_PATH_IMAGE194
first to show the prospectiThe characteristics of the device are as follows,
Figure 851198DEST_PATH_IMAGE195
indicating taking the backgroundjThe characteristics of the device are as follows,fgthe representation of the foreground is performed,bgrepresenting the background.
Loss of projection
Figure DEST_PATH_IMAGE196
Comprises the following steps:
Figure 908147DEST_PATH_IMAGE197
(18)
in the formula (18), the reaction mixture,
Figure DEST_PATH_IMAGE198
is shown as 3DFeature projection to 2DRear sum 2DThe true value is used for making a difference value,
Figure 769924DEST_PATH_IMAGE199
feature points and masks representing the neural network predictions.
In the step S2, the pose calculation and rendering of the object of interest may be implemented by the mobile terminal, or may be implemented by mixing the mobile terminal and the cloud server.
The mode of realizing the pose calculation and the rendering of the interested object through the mobile terminal is suitable for the condition that the user-defined models are few. Before tracking is started, the cloud server is accessed only once, and after the object model, the deep learning model, the feature database and the like of the user are downloaded, other calculations are carried out on the mobile terminal. The mobile terminal reads camera data from the equipment, obtains the pose of an object by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom, and then renders the content to be rendered according to the pose.
The mode of realizing the pose calculation and rendering of the interested object by mixing the mobile terminal and the cloud server is suitable for the condition that more user-defined models are available, and is a general object pose tracking solution. In the tracking process, the cloud server needs to be accessed and resources downloaded one or more times. The mobile terminal inputs an image sequence and outputs an object pose and a rendered image.
The main flow of the mode is as follows: inputting an image sequence in the mobile terminal, performing significance detection on each frame of image, uploading a significance detection area to the cloud server for retrieval, obtaining object information and a depth learning model related to the object information, loading the object information and the depth learning model to the mobile terminal for pose estimation, then obtaining an object pose, and rendering content to be rendered according to the pose.
The object anchoring method provided by the application adopts a modeling mode of unsupervised deep learning, only a small number of feature points are needed to be provided, the initial camera posture is calculated, modeling can be carried out, and the feature points on the object are not needed, so that modeling can be carried out on a pure-color object or an object with less texture.
According to the object anchoring method, the model which is used for carrying out recognition and 3D position and posture tracking by using the 2D image is trained by adopting synthetic data synthesis and real data synthesis, the problem that inaccuracy, illumination, environment and the like have great influence on the algorithm when a user self-defines object recognition and 3D tracking can be solved, and then the method for obtaining and displaying the self-defined object information of the mobile terminal is realized, and the information is displayed and corresponds to the 3D position and posture of the object.
According to the object anchoring method, the problem that workload of manual marking is large and speed is low can be solved by adopting a method of combining modeling rendering synthetic data and automatic marking real data, efficiency and accuracy of model training are improved, a deep learning model of a user-defined object can be tracked possibly, and the tracking initialization can be automatic initialization and is low in sensitivity to illumination, environment and the like.
According to the object anchoring method, the end cloud combined framework is adopted, so that large-scale object recognition and 3D position and posture tracking of the mobile terminal are possible.
Based on the object anchoring method provided by the application, the application also provides an object anchoring system provided by the application.
Fig. 2 is a schematic structural diagram of an object anchoring system according to an embodiment of the present application.
As shown in fig. 2, the object anchoring system provided in the embodiment of the present application includes a cloud training unit 1 and an object pose calculation and rendering unit 2. The cloud training unit 1 is used for obtaining a three-dimensional model of an interested object and a six-degree-of-freedom pose estimation neural network model for object posture estimation through training according to an acquired image sequence containing the interested object. The object pose calculation and rendering unit 2 is configured to perform pose estimation on the object of interest according to the three-dimensional model of the object of interest and the six-degree-of-freedom pose estimation neural network model for object pose estimation, and superimpose virtual information on the object of interest to implement rendering of the object of interest.
In the present embodiment, as shown in fig. 3, the cloud training unit 1 includes a modeling unit 11, a synthetic training data generating unit 12, a real training data generating unit 13, and a training algorithm unit 14.
The modeling unit 11 is configured to train a three-dimensional model of the object of interest according to the acquired image sequence including the object of interest.
The synthetic training data generating unit 12 is configured to obtain a synthetic data set according to the three-dimensional model of the object and the preset scene model, where the synthetic data set includes synthetic training data.
The real training data generating unit 13 is configured to obtain a real data set according to the camera pose and the object pose, where the real data set includes real training data.
The training algorithm unit 14 is configured to train the pose estimation neural network based on six degrees of freedom for deep learning according to the synthetic training data and the real training data, so as to obtain a pose estimation neural network model based on six degrees of freedom.
In a specific embodiment, the modeling unit 11 comprises a deep learning based modeling unit and a computer vision based modeling unit.
As shown in fig. 4, the input of the deep learning based modeling unit is a sequence of images, the output of which is a deep learning model. And inputting the multiple images into the deep learning model for inference to obtain grids and textures.
The modeling process of the deep learning based modeling unit is the same as the content of the above steps S111-S113, and is not repeated here.
As shown in fig. 5, the input of the computer vision based modeling unit is a sequence of images, the output of which is a modeled stereo model.
The modeling process of the computer vision-based modeling unit is the same as that of the above steps S121 to S127, and is not repeated here.
In the above-described embodiment, as shown in fig. 6 and 7, the synthetic training data generating unit 12 includes a PBR (physical-Based Rendering) Rendering unit. The PBR rendering unit 121 reads the stereo model and the preset scene model of the object by using the render frames such as the blend, unity, and the like, and performs object pose randomization, rendering camera pose randomization, material randomization, and illumination randomization to obtain a series of image sequences and corresponding annotation tags. The label can be a category, a position, a pose with six degrees of freedom, and the like.
As shown in fig. 6 and 8, the synthetic training data generating unit 12 further includes a synthetic image reality migrating unit 122, where the synthetic image reality migrating unit 122 reads the stereo model or the real image or the PBR image, performs preprocessing such as background removal on the image, and then generates synthetic images at different angles and their corresponding label labels through a deep learning Network such as GAN (generic advanced Network) or NERF (Neural radiation Fields). The label can be a category, a position, a pose with six degrees of freedom, and the like.
In the above embodiment, as shown in fig. 9, the real training data generating unit 13 includes the model reprojection segmentation algorithm unit 131. The model re-projection segmentation algorithm unit 131 obtains the image sequence, the camera pose and the object pose, and segments the object in the real image.
The real training data generating unit 13 further includes an inter-frame data synthesizing unit 132, which is configured to synthesize the real data with discrete poses into data with more dense and continuous poses, so as to obtain a real image and its corresponding label. The label can be a category, a position, a pose with six degrees of freedom, and the like.
In the above embodiment, the training algorithm unit 14 trains the six-degree-of-freedom pose estimation neural network based on deep learning according to the synthetic training data and the real training data.
And training a six-degree-of-freedom pose estimation neural network by using an end-to-end method. And the object detection and the pose estimation with six degrees of freedom can be finished by one network. The six-degree-of-freedom pose estimation neural network inputs 2D coordinates of a plurality of characteristic points extracted from an image and an object, 3D coordinates corresponding to the characteristic points, and an image mask. The network structure is shown in figure 9 of the drawings,
Figure DEST_PATH_IMAGE200
a neural network of a first stage for outputting a detection box;
Figure 545113DEST_PATH_IMAGE201
a second stage neural network that is used to compute 2D and 3D keypoints for the object. The cross entropy of the mask is mainly used for removing the interference of background features, 2D key points are regressed in a Gaussian thermodynamic diagram mode, 3D key points need to be normalized to be 0-1 based on the initial posture of an object, and projection errors are used for guaranteeing 2D and 3D relationsConsistency of key points.
The loss functions required for training the pose estimation neural network with six degrees of freedom are the same as those in the above equations (12) to (18), and are not described in detail here.
In the above embodiments, the object pose calculation and rendering unit 2 may be implemented by a mobile terminal, or may be implemented by mixing the mobile terminal and a cloud server.
As shown in fig. 10, the mode in which the object pose calculation and rendering unit 2 is implemented by the mobile terminal is suitable for the case where there are few custom models for the user. Before tracking is started, the cloud server is accessed only once, and after the object model, the deep learning model, the feature database and the like of the user are downloaded, other calculations are carried out on the mobile terminal. The mobile terminal reads camera data from the equipment, obtains the pose of an object by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom, and then renders the content to be rendered according to the pose.
As shown in fig. 11, the mode of the object pose calculation and rendering unit 2 implemented by mixing the mobile terminal and the cloud server is suitable for the case of many user-defined models of users, and is a solution for tracking the object pose in general. In the tracking process, the cloud server needs to be accessed and resources downloaded one or more times. The mobile terminal inputs an image sequence and outputs an object pose and a rendered image.
The main flow of the mode is as follows: inputting an image sequence in the mobile terminal, performing significance detection on each frame of image, uploading a significance detection area to the cloud server for retrieval, obtaining object information and a depth learning model related to the object information, loading the object information and the depth learning model to the mobile terminal for pose estimation, then obtaining an object pose, and rendering content to be rendered according to the pose.
It should be noted that: the object anchoring system provided in the above embodiment is only illustrated by the division of the above program modules, and in practical applications, the above processing may be distributed to different program modules according to needs, that is, the internal structure of the object anchoring system is divided into different program modules to complete all or part of the above-described processing. In addition, the embodiments of the object anchoring system and the object anchoring method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the embodiments of the methods for details, which are not described herein again.
In an exemplary embodiment, the present application further provides a storage medium, which is a computer readable storage medium, for example, a memory including a computer program, which is executable by a processor to perform the steps of the foregoing object anchoring method.
The embodiments of the present application described above may be implemented in various hardware, software code, or a combination of both. For example, the embodiments of the present application may also be program code for executing the above-described method in a data signal processor. The present application may also relate to various functions performed by a computer processor, digital signal processor, microprocessor, or field programmable gate array. The processor described above may be configured in accordance with the present application to perform certain tasks by executing machine-readable software code or firmware code that defines certain methods disclosed herein. Software code or firmware code may be developed in different programming languages and in different formats or forms. Software code may also be compiled for different target platforms. However, different code styles, types, and languages of software code and other types of configuration code for performing tasks according to the present application do not depart from the spirit and scope of the present application.
The foregoing is merely an illustrative embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principles of the present application shall fall within the protection scope of the present application.

Claims (8)

1. A method of anchoring an object, comprising the steps of:
training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object; in the process of training and obtaining the three-dimensional model of the interested object according to the obtained image sequence containing the interested object, the modeling is completed based on the deep learning or the computer vision, and the process of completing the modeling based on the deep learning is as follows:
extracting the characteristics of each frame of image, and estimating the camera initialization pose corresponding to each frame of image;
acquiring a mask of each frame of image by utilizing a pre-trained significance segmentation network;
carrying out model training and inference to obtain a grid of the model;
the process of completing modeling based on computer vision is as follows:
performing feature extraction and matching by adopting a visual algorithm or a deep learning algorithm;
estimating the pose of the camera;
segmenting salient objects in the image sequence;
reconstructing the dense point cloud;
using the reconstructed dense point cloud as the input of grid generation, and reconstructing the grid of the object by using a reconstruction algorithm;
finding out texture coordinates corresponding to the grid vertex according to the camera pose and the image corresponding to the camera pose to obtain a mapping of the grid;
obtaining a three-dimensional model according to the grids of the object and the mapping of the grids;
the specific process of training the six-degree-of-freedom pose estimation neural network model for object pose estimation according to the acquired image sequence containing the object of interest comprises the following steps:
obtaining a synthetic data set by adopting a PBR rendering method according to the three-dimensional model and the preset scene model of the object; the synthetic dataset includes synthetic training data;
obtaining a real data set by adopting a model reprojection segmentation algorithm according to the camera pose and the object pose; the real dataset comprises real training data;
training a six-degree-of-freedom pose estimation neural network based on deep learning by utilizing the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model;
and performing pose estimation on the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for object pose estimation to obtain the pose of the interested object, and superposing virtual information on the interested object according to the pose to realize the rendering of the interested object.
2. The method for anchoring an object according to claim 1, wherein said process of performing model training and inference is:
in the image
Figure DEST_PATH_IMAGE001
Up random acquisition
Figure DEST_PATH_IMAGE002
Individual pixel point, position coordinate of each pixel point
Figure DEST_PATH_IMAGE003
Position coordinates of each pixel point by using internal parameters
Figure DEST_PATH_IMAGE004
Conversion to imaging plane coordinates
Figure DEST_PATH_IMAGE005
Inputting imaging plane coordinates and optimized camera pose into neural network
Figure DEST_PATH_IMAGE006
Extracting the color difference characteristics between frames
Figure DEST_PATH_IMAGE007
(ii) a Characterizing color differences between frames
Figure DEST_PATH_IMAGE008
Adding the color difference to an original image to compensate the color difference between frames;
wherein the color difference characteristic between frames
Figure DEST_PATH_IMAGE009
Comprises the following steps:
Figure DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE011
representing an image truth value;
initializing the camera pose corresponding to the image
Figure DEST_PATH_IMAGE012
Input neural network
Figure DEST_PATH_IMAGE013
In the method, the optimized pose is obtained
Figure DEST_PATH_IMAGE014
Wherein the optimized pose
Figure DEST_PATH_IMAGE015
Comprises the following steps:
Figure DEST_PATH_IMAGE016
according to the optimized pose
Figure DEST_PATH_IMAGE017
Obtaining an initial position of an optimized camera
Figure DEST_PATH_IMAGE018
Wherein, the initial position of camera after optimizing is:
Figure DEST_PATH_IMAGE019
in the formula (I), the compound is shown in the specification,Tis a function, which represents taking the position coordinates;
initial position of self-optimized camera
Figure DEST_PATH_IMAGE020
Emitting light rays in a direction ofwPassing through the position coordinates of the pixel points
Figure DEST_PATH_IMAGE021
Wherein the direction of the lightwComprises the following steps:
Figure DEST_PATH_IMAGE022
in the direction ofwSamplingMDot
Figure DEST_PATH_IMAGE023
This isMThe coordinates of the points are
Figure DEST_PATH_IMAGE024
Utilizing deep learning networks
Figure DEST_PATH_IMAGE025
Predict thisMDot
Figure DEST_PATH_IMAGE026
Probability at the surface of the implicit equation;
wherein, the judgment condition of the point predicted to be on the surface of the implicit equation is as follows:
Figure DEST_PATH_IMAGE027
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE028
representing points predicted to be on the surface of the implicit equation,
Figure DEST_PATH_IMAGE029
a threshold value is indicated which is indicative of,
Figure DEST_PATH_IMAGE030
indicating minimum compliancem
Will predict as points on the surface of the implicit equation
Figure DEST_PATH_IMAGE031
Send into neural rendererRObtaining the values of the predicted RGB colors
Figure DEST_PATH_IMAGE032
Wherein the predicted RGB color values
Figure DEST_PATH_IMAGE033
Comprises the following steps:
Figure DEST_PATH_IMAGE034
according to prediction
Figure DEST_PATH_IMAGE035
Value and acquisitionKCalculating the color of each pixel point to obtain the square loss of the pixel difference value;
wherein the square loss of pixel differenceLComprises the following steps:
Figure DEST_PATH_IMAGE036
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE037
all represent coefficients;
Figure DEST_PATH_IMAGE038
representing the difference values of the pixels of the image,
Figure DEST_PATH_IMAGE039
difference value representing background mask
Figure DEST_PATH_IMAGE040
Difference from foreground mask
Figure DEST_PATH_IMAGE041
The sum of the total weight of the components,
Figure DEST_PATH_IMAGE042
representing a difference of the edges;
in which the difference of the image pixels
Figure DEST_PATH_IMAGE043
Comprises the following steps:
Figure DEST_PATH_IMAGE044
in the formula (I), the compound is shown in the specification,Pindicating all selectionskThe point of the light beam is the point,
Figure DEST_PATH_IMAGE045
representing a predicted color value;
difference of background mask
Figure DEST_PATH_IMAGE046
Comprises the following steps:
Figure DEST_PATH_IMAGE047
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE051
Figure DEST_PATH_IMAGE048
indicating all selectionskOut of the dots, dots outside the mask;
difference of foreground mask
Figure DEST_PATH_IMAGE049
Comprises the following steps:
Figure DEST_PATH_IMAGE050
in the formula (I), the compound is shown in the specification,BCErepresenting a two-value cross-entropy loss,
Figure DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE051
indicating all selectionskOne of the dots within the mask;
difference of edge
Figure DEST_PATH_IMAGE052
Comprises the following steps:
Figure DEST_PATH_IMAGE053
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE054
representing the boundaries of the mask;
when model is inferred, neural network
Figure DEST_PATH_IMAGE055
Deep learning network
Figure DEST_PATH_IMAGE056
And neural networks
Figure DEST_PATH_IMAGE057
Input 3 in the combined model ofDPoint; the combined model is used to obtain the points present on its surface, from which a mesh is formed.
3. The object anchoring method according to claim 1, wherein the specific process of obtaining the synthetic dataset by using the PBR rendering method according to the stereoscopic model and the preset scene model of the object is:
reading a three-dimensional model and a preset scene model of an object;
carrying out object pose randomization, rendering camera pose randomization, material randomization and illumination randomization by adopting a PBR rendering method to obtain a series of image sequences and corresponding labeling labels; the label labels are of category, position and pose with six degrees of freedom.
4. The object anchoring method according to claim 1, wherein the specific process of obtaining the real dataset by using the model reprojection segmentation algorithm according to the camera pose and the object pose is as follows:
acquiring an image sequence, a camera pose and an object pose, and segmenting an object in a real image;
synthesizing the real data with discrete poses into data with dense and continuous poses, and further obtaining a real image and a corresponding label thereof; the label labels are of category, position and pose with six degrees of freedom.
5. The object anchoring method according to claim 1, wherein the training of the six-degree-of-freedom pose estimation neural network based on deep learning by using the synthetic training data and the real training data to obtain the six-degree-of-freedom pose estimation neural network model comprises:
inputting 2D coordinates of a plurality of characteristic points extracted from an image and an object, 3D coordinates corresponding to the characteristic points and an image mask;
training the six-degree-of-freedom pose estimation neural network by adopting the following loss function to obtain a six-degree-of-freedom pose estimation neural network model;
the loss function needed when training the six-degree-of-freedom pose estimation neural network is as follows:
Figure DEST_PATH_IMAGE058
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE059
it is indicated that there is a loss of,
Figure DEST_PATH_IMAGE060
are all indicative of the coefficients of the,
Figure DEST_PATH_IMAGE061
a loss of classification is indicated and,
Figure DEST_PATH_IMAGE062
indicating that the loss of the bounding box,
Figure DEST_PATH_IMAGE063
which represents the loss in the 2D representation,
Figure DEST_PATH_IMAGE064
which represents the loss in 3D to the user,
Figure DEST_PATH_IMAGE065
which is indicative of a loss of the mask,
Figure DEST_PATH_IMAGE066
representing a projection loss;
wherein the classification is lost
Figure DEST_PATH_IMAGE067
Comprises the following steps:
Figure DEST_PATH_IMAGE068
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE069
is shown to take the first placeiThe classification information of each of the detection anchor points,
Figure DEST_PATH_IMAGE070
is shown to take the first placejInformation of individual background features;
Figure DEST_PATH_IMAGE071
the anchor point is represented by a representation of,
Figure DEST_PATH_IMAGE072
an anchor point representing the background is shown,
Figure DEST_PATH_IMAGE073
a true value of the category is represented,
Figure DEST_PATH_IMAGE074
representing features proposed by a neural network;
loss of bounding box
Figure DEST_PATH_IMAGE075
Comprises the following steps:
Figure DEST_PATH_IMAGE076
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE077
is shown asiThe coordinate characteristics of each of the detection anchor points,
Figure DEST_PATH_IMAGE078
representing the true value of the coordinate of the detection frame;
2D loss
Figure DEST_PATH_IMAGE079
Comprises the following steps:
Figure DEST_PATH_IMAGE080
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE081
is expressed as 2DThe characteristics of the coordinates are such that,
Figure DEST_PATH_IMAGE082
2 for representing an objectDA true value of the characteristic point;
3D loss
Figure DEST_PATH_IMAGE083
Comprises the following steps:
Figure DEST_PATH_IMAGE084
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE085
is expressed by 3DThe characteristics of the coordinates are such that,
Figure DEST_PATH_IMAGE086
3 for representing an objectDA true value of the characteristic point;
mask loss
Figure DEST_PATH_IMAGE087
Comprises the following steps:
Figure DEST_PATH_IMAGE088
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE089
first to show the prospectiThe characteristics of the device are as follows,
Figure 853643DEST_PATH_IMAGE090
indicating taking the backgroundjThe characteristics of the device are as follows,fgthe representation of the foreground is performed,bgrepresenting a background;
loss of projection
Figure DEST_PATH_IMAGE091
Comprises the following steps:
Figure 527201DEST_PATH_IMAGE092
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE093
is shown as 3DFeature projection to 2DRear sum 2DThe true value is used for making a difference value,
Figure 398205DEST_PATH_IMAGE094
feature points and masks representing the neural network predictions.
6. The object anchoring method according to claim 1, wherein the rendering of the object of interest is implemented by a mobile terminal or by a mobile terminal mixed with a cloud server;
the process realized by the mobile terminal is as follows:
before tracking is started, accessing a cloud server, downloading an object model, a deep learning model and a feature database of a user, and then performing other calculations on a mobile terminal;
the mobile terminal reads camera data from the equipment, and the object pose is obtained by detecting or identifying the neural network and estimating the neural network by the pose of six degrees of freedom;
rendering the content to be rendered according to the pose of the object;
the process of realizing the mixing of the mobile terminal and the cloud server is as follows:
inputting an image sequence in the mobile terminal, and performing significance detection on each frame of image;
uploading the significance detection area to a cloud server for retrieval to obtain information of the object and a deep learning model related to the information, and loading the information to the mobile terminal;
estimating the position and the attitude of an object at the mobile terminal to obtain the position and the attitude of the object;
and rendering the content to be rendered according to the pose of the object.
7. An object anchoring system is characterized by comprising a cloud training unit and an object pose calculation and rendering unit;
the cloud training unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object and a six-degree-of-freedom pose estimation neural network model for estimating the pose of the object;
the object pose calculation and rendering unit is used for estimating the pose of the interested object according to the three-dimensional model of the interested object and the six-degree-of-freedom pose estimation neural network model for estimating the pose of the interested object, and superposing virtual information on the interested object to realize the rendering of the interested object;
the cloud training unit comprises a modeling unit, a synthetic training data generating unit, a real training data generating unit and a training algorithm unit;
the modeling unit is used for training according to the acquired image sequence containing the interested object to obtain a three-dimensional model of the interested object;
the synthetic training data generation unit is used for obtaining a synthetic data set according to a three-dimensional model of an object and a preset scene model, and the synthetic data set comprises synthetic training data;
the real training data generation unit is used for obtaining a real data set according to the camera pose and the object pose, and the real data set comprises real training data;
and the training algorithm unit is used for training the six-degree-of-freedom pose estimation neural network based on deep learning according to the synthetic training data and the real training data to obtain a six-degree-of-freedom pose estimation neural network model.
8. A storage medium having stored thereon an executable program which, when invoked, performs the steps in the object anchoring method according to any one of claims 1 to 6.
CN202210173770.0A 2022-02-25 2022-02-25 Object anchoring method, anchoring system and storage medium Active CN114241013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210173770.0A CN114241013B (en) 2022-02-25 2022-02-25 Object anchoring method, anchoring system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210173770.0A CN114241013B (en) 2022-02-25 2022-02-25 Object anchoring method, anchoring system and storage medium

Publications (2)

Publication Number Publication Date
CN114241013A CN114241013A (en) 2022-03-25
CN114241013B true CN114241013B (en) 2022-05-10

Family

ID=80748105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210173770.0A Active CN114241013B (en) 2022-02-25 2022-02-25 Object anchoring method, anchoring system and storage medium

Country Status (1)

Country Link
CN (1) CN114241013B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9996936B2 (en) * 2016-05-20 2018-06-12 Qualcomm Incorporated Predictor-corrector based pose detection
EP3705049A1 (en) * 2019-03-06 2020-09-09 Piur Imaging GmbH Apparatus and method for determining motion of an ultrasound probe including a forward-backward directedness
CN112884820B (en) * 2019-11-29 2024-06-25 杭州三坛医疗科技有限公司 Image initial registration and neural network training method, device and equipment

Also Published As

Publication number Publication date
CN114241013A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
EP3944200B1 (en) Facial image generation method and apparatus, device and storage medium
Hepp et al. Learn-to-score: Efficient 3d scene exploration by predicting view utility
CN108876814B (en) Method for generating attitude flow image
CN113822993B (en) Digital twinning method and system based on 3D model matching
US20220415030A1 (en) AR-Assisted Synthetic Data Generation for Training Machine Learning Models
Joshi et al. Deepurl: Deep pose estimation framework for underwater relative localization
CN115428027A (en) Neural opaque point cloud
CN116070687B (en) Neural network light field representation method based on global ray space affine transformation
KR20230150867A (en) Multi-view neural person prediction using implicit discriminative renderer to capture facial expressions, body posture geometry, and clothing performance
Jeon et al. Struct-MDC: Mesh-refined unsupervised depth completion leveraging structural regularities from visual SLAM
CN115018989A (en) Three-dimensional dynamic reconstruction method based on RGB-D sequence, training device and electronic equipment
CN115953476A (en) Human body free visual angle synthesis method based on generalizable nerve radiation field
CN118505878A (en) Three-dimensional reconstruction method and system for single-view repetitive object scene
CN118154770A (en) Single tree image three-dimensional reconstruction method and device based on nerve radiation field
Maxim et al. A survey on the current state of the art on deep learning 3D reconstruction
Yao et al. Neural Radiance Field-based Visual Rendering: A Comprehensive Review
US20240161362A1 (en) Target-augmented material maps
CN114241013B (en) Object anchoring method, anchoring system and storage medium
Englert et al. Enhancing the ar experience with machine learning services
CN113034675B (en) Scene model construction method, intelligent terminal and computer readable storage medium
Bubenıcek Using Game Engine to Generate Synthetic Datasets for Machine Learning
CN115841546A (en) Scene structure associated subway station multi-view vector simulation rendering method and system
Johnston et al. Single View 3D Point Cloud Reconstruction using Novel View Synthesis and Self-Supervised Depth Estimation
Karkalou et al. Semi-global matching with self-adjusting penalties
CN114581571B (en) Monocular human body reconstruction method and device based on IMU and forward deformation field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant